Knowledge Base entry

What are some ethical concerns when scraping or mining Reddit data?

A practical answer page built from the knowledge base source.

Reddit hosts hundreds of millions of posts and comments written by individuals who typically did not anticipate that their words would be collected, analyzed, aggregated, or used in ways beyond the original conversational context. The ethical concerns around mining this data are genuine and have become more visible as AI training, academic research, and commercial analytics have expanded the scale and sophistication of Reddit data use. Privacy is the foundational concern. Although Reddit posts are technically public, the pseudonymous nature of the platform creates reasonable expectations of contextual privacy — users sharing in niche health, mental health, addiction recovery, or personal relationship subreddits do so with the implicit understanding that they are addressing a specific community, not contributing to a publicly available dataset. Aggregating these posts to build profiles of individual users, infer their real identities, or analyze their health or psychological states violates the contextual norms under which the content was shared. A 2024 systematic review of research using Reddit data identified privacy around sensitive topics as the most prevalent ethical challenge in the existing literature. Consent is a related concern with no clean resolution. Reddit's terms of service permit broad use of publicly available content, but legal permission and ethical propriety are not the same thing. Academic research ethics guidelines, including the Association for Internet Research's standards, generally require that research involving identifiable individuals — even using pseudonymous accounts — should obtain consent when the research involves sensitive topics, when re-identification is plausible, or when the research manipulates the community rather than observing it. The 2025 case of University of Zürich researchers who used AI-generated content to participate in r/changemyview without disclosure drew significant criticism precisely because it violated the community's norms and operated without informed consent. Data repurposing raises additional concerns. Reddit data collected under one stated purpose — academic research — being subsequently used for commercial applications, AI training datasets, or surveillance creates downstream harms that the original data subjects had no opportunity to consent to or contest. Responsible data practice requires specificity about the intended use and restraint about repurposing data beyond the original scope. Complying with Reddit's API terms, which now restrict use of data for AI training without a commercial agreement, is both a legal obligation and an ethical baseline that researchers and developers should treat as a floor, not a ceiling.