Spiders and crawlers and scrapers, oh my! The law and ethics of researchers violating terms of service

6 min readApr 3, 2020

a “data spider” crawling its way across data scraping provisions from a number of websites

Last week, a federal court ruled that researchers violating a website’s terms of service (TOS) in order to conduct research aimed at uncovering discriminatory algorithms does not violate the U.S.’s Computer Fraud and Abuse Act (CFAA). Coincidentally, last week I also finalized the publication of a paper forthcoming at ICWSM 2020* (an AAAI social computing conference), based on a study conducted with my collaborators Nathan Beard (now a PhD student at UMD) and Brian Keegan: “No Robots, Spiders, or Scrapers: Legal and Ethical Regulation of Data Collection Methods in Social Media Terms of Service.”

My research as part of the PERVADE (pervasive data ethics for computational research) project covers the ethical implications of research that involves collecting publicly available data, like tweets or reddit posts. And when you ask people whether it’s “okay” to collect data from a website, there are two common responses: (1) Is the data “public”? and (2) Does collecting data violate TOS?

I won’t address the first question here (though it is absolutely not a sufficient question, and you can read a lot more about my thoughts on that in a different article), and instead focus on the second. Whether it is an ethical violation for researchers to violate TOS — particularly data collection/scraping provisions — has been a topic of debate for years. In fact, two of the plaintiffs in the CFAA case, Karrie Karahalios and Christian Sandvig, were authors on a position paper on this topic for the very first research ethics workshop I organized at CSCW in 2015. The SIGCHI ethics committee (of which I am a member) also published a blog post grappling with this topic in 2017, and I have previously blogged here about this topic following the initial HiQ v. LinkedIN decision.

But after thinking about this sticky issue for a while, Nathan, Brian, and I wondered: what are researchers actually violating, anyway? What do these provisions in TOS even look like, and what is the relationship between law and ethics here? After all, law and ethics are not the same thing, and even if it was clear that a law is being broken (which it isn’t, thanks to these court cases and more general ambiguity in other areas), that wouldn’t necessarily implicate the ethics of the act. After all, if you jaywalk in order to push someone out of the way of a speeding car, I doubt many would argue that’s an unethical act. (Similarly, I would argue that the court decision in Sandvig v. Barr is the appropriate one, ethically — rooting out discrimination could be seen as an ethical imperative that trumps what amounts to TOS jaywalking.)

For this study, we qualitatively analyzed the data collection provisions from 116 social media sites (as seeded by this list as it appeared in late 2017), and identified non-mutually-exclusive categories of types (e.g., “no automated collection” versus “no manual collection”) as well as themes around context. There is a lot more detail in the paper about both methods and findings (including a giant table of categorized sites — that should not be taken as accurate since TOS change frequently and this data is three years old), but here are the major takeaways:

(1) Data collection provisions are extremely vague and highly inconsistent across sites. This isn’t surprising, and I’ve found this in analysis of both copyright licenses and harassment policies in TOS. But it means that it’s very difficult to make decisions based TOS when, honestly, they’re not designed to be understood in the first place. But if you’re scratching your head and wondering what constitutes “personal” data that you’re not supposed to collect — who knows?

(2) Data collection provisions are almost entirely lacking in context. With the exception of a few vague mentions of “personal” data, and one platform that specifically prohibits academic research, most data collection provisions are entirely context-agnostic when it comes to, for example, what the specific data being collected is, who is collecting it, what it’s being used for, and what the expectations and potential harms might be for the people who created that data. This is important because these are the things that should matter most for an ethical analysis!

(3) The problem with a TOS-based decision on collecting data is that it assumes that (a) violating TOS is inherently unethical; and (b) violating TOS is the only thing that could make collecting data unethical. I suggest that neither of these is true. There might be situations in which violating TOS against the wishes of a company could be an ethical act — for example, uncovering discrimination. There’s also the fact that if research on a platform can only be conducted by researchers explicitly given access to that platform, this might skew scientific discovery, particularly if researchers may be constrained by the company that employs them. There are also many sites that don’t prohibit data collection, but where the research might have ethical problems for any number of reasons. Just because something might be legal doesn’t mean that it’s fair game ethically. Even well-intentioned research or application might be harmful, regardless of legality.

(4) If you are making decisions about collecting and using data that was collected from people, context is critical, and sometimes there aren’t rules to follow like there are in law. Research ethics isn’t necessarily so much about following rules as it is about being thoughtful and carefully examining each situation contextually. For example, user expectations might be an ethical consideration, but TOS are unlikely to be a proxy for those expectations (since we know no one uses them); therefore, better ways to determine expectations might be seeking guidance from gatekeepers or members of the community.

This kind of care does involve more work than would following a clear set of guidelines. And I suspect that a lot of readers may find our conclusions frustrating (and some of the folks who reviewed this paper over a few rounds did!), because they do not include a clear answer as to whether violating TOS is unethical or not. However, ethics by its very nature rarely has right or wrong answers, only those that are culturally and normatively informed.

Our goal with this work was to provide some additional context and analysis for researchers who are making these decisions. The paper itself has a lot more detail about the legal landscape (that I don’t think has been derailed by the recent court decision!) that also might be useful in considering a risk-based analysis. I hope that this will be helpful for researchers, and as always I think that one of the most important things you can do is to include information about your ethical consideration and reasoning in the methods section of your paper. Importantly, I do not think that our paper should be used to support a blanket statement that a TOS violation isn’t inherently bad, but instead to support the kinds of ethical considerations that you should actually be making (beyond TOS) in making decisions about collecting, using, and reporting public data.

Fiesler, Casey, Nathan Beard, and Brian C. Keegan. “No Robots, Spiders, or Scrapers: Legal and Ethical Regulation of Data Collection Methods in Social Media Terms of Service.” In Proceedings of the AAAI ICWSM International Conference on Web and Social Media. 2020. [This work was supported by NSF Award #1704303 as part of the PERVADE (Pervasive Data Ethics for Computational Research) collaborative research project.]

* Sadly the conference has been canceled due to COVID-19, but I really hope that this work finds an audience anyway!

Spiders and crawlers and scrapers, oh my! The law and ethics of researchers violating terms of service

Written by Casey Fiesler