There is some confusion about ‘scraping’, what it is, whether it is legal and how it can be used. ‘Scraping’ of the web is just automated access to websites and it is lawful. Legal departments know this, which is why some of the largest companies in the world use Import.io to convert the web into structured data for use in their businesses.
I helped with a blog post that was published yesterday by Jamie Williams from the EFF as part of their Coders’ Rights Project that works to protect programmers and developers, engaged in the cutting-edge exploration of technology, from badly drafted computer crime laws.
The piece is aimed at legal tech people and is intended to contribute to the legal debate about automated access to websites (especially with a view to informing the judges ruling on the hiQ v LinkedIn case) so it is a bit specialized.
But it clarifies nicely some of the general points about the importance and opportunity with regards to data from the web and explains how everyone does it, even (and especially) the people who don’t want it done to them.
Legal departments know this, and they sign off on the use of Import.io technology all day everyday, which is why we have some of the biggest companies in the world as our customers.
However it is an evolving space and a small number of cases continue to come to court in this area. Which is why it is important to continue contributing to the debate through organizations like the EFF.
Getting this kind of clear and accurate information out there is also an important riposte to the false, confused and misleading content coming from some other organizations who try to conflate talk of automated access to websites with “hacking”.
The article can be found on the EFF website.
‘Scraping’ is Just Automated Access, and Everyone Does It
By Jamie Williams | April 17, 2018
For tech lawyers, one of the hottest questions this year is: can companies use the Computer Fraud and Abuse Act (CFAA)—an imprecise and outdated criminal anti-“hacking” statute intended to target computer break-ins—to block their competitors from accessing publicly available information on their websites? The answer to this question has wide-ranging implications for everyone: it could impact the public’s ability to meaningfully access publicly available information on the open web. This will impede investigative journalism and research. And in a world of algorithms and artificial intelligence, lack of access to data is a barrier to product innovation, and blocking access to data means blocking any chance for meaningful competition.
The CFAA was enacted in 1986, when there were only about 2,000 computers connected to the Internet. The law makes it a crime to access a computer connected to the Internet “without authorization” but fails to explain what this means. It was passed with the aim of outlawing computer break-ins, but has since metastasized in some jurisdictions into a tool to enforce computer use policies, like terms of service, which no one reads.
Efforts to use the CFAA to threaten competitors increased in 2016 following the Ninth Circuit’s poorly reasoned Facebook v. Power Ventures decision. The case involved a dispute between Facebook and a social media aggregator, which Facebook users had voluntarily signed up for. Facebook did not want its users engaging with this service, so it sent Power Ventures a cease and desist letter and tried to block Power Ventures’ IP address. The Ninth Circuit found that Power Ventures had violated the CFAA after continuing to provide its services after receipt of the cease and desist letter and having one of its IP address blocked.
After the decision was issued, companies—almost immediately—started citing the case in cease and desist letters, demanding that competitors stop using automated methods to access publicly available information on their websites. Some of these disputes have made their way to court, the most high profile of which is hiQ v. LinkedIn, which involves automated access of publicly available LinkedIn data. As law professor Orin Kerr has explained, posting information on the web and then telling someone they are not authorized to access it is “like publishing a newspaper but then forbidding someone to read it.”
The web is the largest, ever-growing data source on the planet. It’s a critical resource for journalists, academics, businesses, and everyday people alike. But meaningful access sometimes requires the assistance of technology, automating, and expediting an otherwise tedious process of accessing, collecting and analyzing public information. This process of using a computer to automatically load and read the pages of a website for later analysis is often referred to as “web scraping.”
As a technical matter, web scraping is simply machine automated web browsing. There is nothing that can be done with a web scraper that cannot be done by a human with a web browser. And it is important to understand that web scraping is a widely used method of interacting with the content on the web: everyone does it—even (and especially) the companies trying to convince courts to punish others for the same behavior.
Companies use automated web browsing products to gather web data for a wide variety of uses. Some examples from industry include manufacturers tracking the performance ranking of products in the search results of retailer websites, companies monitoring information posted publicly on social media to keep tabs on issues that require customer support, and businesses staying up to date on news stories relevant to their industry across multiple sources. E-commerce businesses use automated web browsing to monitor competitors’ pricing and inventory, and to aggregate information to help manage supply chains. Businesses also use automated web browsers to monitor websites for fraud, perform due diligence checks on their customers and suppliers, and to collect market data to help plan for the future.
These examples are not hypothetical. They come directly from Andrew Fogg, the founder of Import.io, a company that provides software that allows organizations to automatically browse the web, and are based on Import.io’s customers and users. And these examples are not the exception; they are the rule. Gartner recommends that all businesses treat the web as their largest data source and predicts that the ability to compete in the digital economy will depend on the ability to curate and leverage web data. In the words of Gartner VP Doug Laney, “Your company’s biggest database isn’t your . . . internal database. Rather it’s the Web itself.”
Journalists and information aggregators also rely on automated web browsing. The San Francisco Chronicle used automated web browsing to gather data on Airbnb properties in order to assess the impact of Airbnb listings on the San Francisco rental market, and ProPublica used automated web browsing to uncover that Amazon’s pricing algorithm was hiding the best deals from its customers. The Internet Archive’s web crawlers (crawlers are one specialized example of automated web browsing) work to archive as much of the public web as possible for future generations. Indeed Google’s own web crawlers that power the search tool most of us rely on every day are simply web scraping “bots.”
During a recent Ninth Circuit hearing in hiQ v. Linkedin, LinkedIn tried to analogize the case to United States v. Jones, arguing that hiQ’s use of automated tools to access public information is different “in kind” than manually accessing that same information, just as long-term GPS monitoring of someone’s public movements is different from merely observing someone’s public movements.
And of course LinkedIn doesn’t like it; it wants to block a competitor’s ability to meaningfully access the information that its users post publicly online. But just because LinkedIn or any other company doesn’t like automated access, that doesn’t mean it should be a crime.
As law professor Michael J. Madison wrote, resolving the debate about the CFAA’s scope “is linked closely to what sort of Internet society has and what sort of Internet society will get in the future.” If courts allow companies to use the CFAA to block automated access by competitors, it will threaten open access to information for everyone.
Some have argued that scraping is what dooms access to public information, because websites will just place their data behind an authentication gate. But it is naïve to think that LinkedIn would put up barriers to access; LinkedIn wants to continue to allow users to make their profiles public so that a web search for a person continues to return a LinkedIn profile among the top results, so that people continue to care about the maintenance of their personal LinkedIn profiles, so that recruiters will continue to pay for access to LinkedIn recruiter products (e.g., specialized search and messaging), and so that companies will continue to pay to post job advertisements on LinkedIn. The default setting for LinkedIn profiles is public for a reason, and LinkedIn wants to keep it that way. It wants to participate in the open web to drive their business but use the CFAA to suppress competitors and avoid accepting the web’s open access norms.
The public is already losing access to information. With the rise of proprietary algorithms and artificial intelligence, both private companies and governments are making high stakes decisions that impact lives with little to no transparency. In this context, it is imperative that courts not take lightly attempts to use the CFAA to limit access to public information on the web.
 The term “scraping” comes from a time before APIs, when the only way to build interoperability between computer systems was to “read” the information directly from the screen. Engineers used various terms to describe this technique, including “shredding,” “scraping,” and “reading.” Because the technique was largely only discussed in engineering circles, the choice of terminology was never widely debated. As a result, today many people still use the term “scraping,” instead of something more technically descriptive—like “screen reading” or “web reading.”
An earlier version of this article was first published by the Daily Journal on March 27, 2018.