Scraping Together Best (Scraping) Practices: Is There an API for That?

Despite its evocative name, “web scraping” is just a blitz of electrical impulses zipping across fiber at approximately the speed of light. Not so easy to envision. I find, however, that many digital phenomena, like this one, are best demystified and understood by conjuring an analogue: 

Last Sunday, those who braved New York City streets to run the Marathon careened down Central Park West, visibly fatigued as they neared the end of the race. The runners, less densely packed by 75th street, stopped briefly at the water stand on the corner to hydrate and grab an orange slice before continuing on the last leg of their 26-mile journey. Their conduct, civil and orderly, would have likely been frustrated if an industrious entrepreneur decided to capitalize on this free-for-all by say, programming a robot that could approach the stand 1000x as fast as any runner, bottle the available water, and proceed to sell it on the very next block. 

Such is the conduct at issue in the long-awaited decision, hiQ Labs, Inc. v. LinkedIn Corp. In this case, on appeal before the 9th Circuit, LinkedIn disputed the district court’s grant of a preliminary injunction to HiQ which prohibited LinkedIn from continuing to block hiQ’s efforts to scrape the site. hiQ, a data analytics company whose core product model depends on scraping public Linkedin profiles, argued that because the LinkedIn pages are open for all to use, it is unnecessary to acquire authorization to scrape their data. LinkedIn claimed, in opposition, that hiQ’s actions violated a number of laws, most notably the Computer Fraud and Abuse Act (CFAA), which makes it a crime to “access a computer without authorization or exceed authorized access.” LinkedIn further alleged that hiQ violated LinkedIn’s terms of use. The company claimed that “[A]uthorization from LinkedIn—the server’s owner—is ‘needed’ to avoid CFAA liability, regardless of whether those servers also host data that LinkedIn generally makes available on its website. hiQ lacked that required ‘authorization’ once LinkedIn sent hiQ its cease-and-desist letter and implemented additional technological barriers restricting bot access.”

Dismissing LinkedIn’s concerns, the 9th Circuit affirmed that the automated scraping of publicly accessible data likely does not violate the CFAA. In the opinion, Judge Berzon, quoting the district court, suggested that Linkedln’s privacy expectations regarding information they have shared in their public profiles are “uncertain at best” and are mere pretext for LinkedIn’s anticompetitive agenda.  LinkedIn issued its cease-and-desist after allowing hiQ’s scraping for years, only forbidding the activity once they decided to produce their own product similar to hiQ’s service. In the court’s analysis, the “possible creation of information monopolies that would deserve the public interest” and “hiQ’s interest in continuing its business” outweighed users’ privacy interests in their public profile information.

While the opinion does leave open the possibility that other laws, such as state trespass to chattels claims, could potentially block scraping activity, the decision may still wind up before the Supreme Court. Regardless of the decision’s ultimate fate, it raises a number of issues that have potentially far-reaching implications for the use of automated scraping, splitting the opinion of Internet advocacy groups.

Reaction of the Internet Advocacy Community 

Proponents of the 9th Circuit’s decision such as the Electronic Frontier Foundation (EFF) celebrate Judge Berzon’s ruling as a victory for the open Internet, arguing that it helps solidify the right to public information. After the lower court’s ruling, the EFF in conjunction with the privacy-focused search engine DuckDuckGo and the Internet Archive filed an amicus brief, lauding the beneficial uses of scraping. Web scraping has been effective in exhuming corporate skeletons, such as racial discrimination on Airbnb, which companies may prefer to suppress. In contrast, critics of the holding like Electronic Privacy Information Center (EPIC) filed an amicus brief arguing that “the lower court has undermined the fiduciary relationship between LinkedIn and its users” and that the order is “contrary to the interests of individual LinkedIn users” and “public interest…because it undermines the principles of modern privacy and data protection law.” Indeed, the decision is somewhat surprising given the 9th Circuit’s posture as a bastion of consumer privacy and the relevant case law on the books. In both U.S. v. Nosal and Facebook v. Power Ventures, the Circuit supported the use of the CFAA to prevent password-sharing.

Concerns About hiQ Labs Inc. v. LinkedIn Corp.

While a free and open Internet is a worthy objective, private business interests shouldn’t be dismissed out of hand. Corporations are also citizens of the web. Despite the potentially circumspect purpose prompting LinkedIn’s conduct in this case, the imperative to restrict or control scraping does not necessarily reflect a corporation’s underlying anticompetitive agenda.

As the analogy of the boot-strapping, marathon-exploiting robot suggests, the 9th Circuit’s ruling makes little sense in the physical world. The EFF has noted that human browsing activity is materially the same as “web scraping,” which “is simply machine-automated web browsing, and accesses and records the same information, which a human visitor to the site might do manually.” However, this comparison is unsatisfying. A human using a web browser accesses the site very differently from bots designed to programmatically collect information at a rate impossible, to an exponential degree, for any human to achieve. APIs, however, as evidenced by LinkedIn’s own outward-facing documentation, provide an opportunity for third parties to retrieve structured data in cooperation with the corporation. It’s true that the corporation, controlling both the issuance of access tokens as well as responses, could selectively block requests or doctor returned data, but this concern should be considered alongside the perspective that scraping itself threatens user security and protection of data.

Permitting unfettered scraping of public data also threatens to privilege large companies that can marshal the engineering and hardware resources necessary to scale with voluminous bot traffic. Small companies may be ill-equipped to serve such surges, at best resulting in a degraded experience for their human customers – slow response times, for instance, when milliseconds matter – and at worst inflicting unintentional denial-of-service (DoS) attacks. Simply maintaining site stability may exhaust the bulk of an engineering teams’ bandwidth, preventing innovation that could be vital to the company’s continued viability. Moreover, sustained bot traffic may materially deplete the budgets of small online companies, which must pay for the computational power and storage required to handle scraping. 

Scraping also undermines many of the principal metrics used for financial reporting. When corporate success is tied to web traffic, the corporation’s inability to confidently distinguish between bot and human activity may prevent the reliable tabulation of site visits. This of course in turn influences stock price, corporate planning, and materials issued to potential investors or hires, skewing the perceived health of the site. Aside from functioning as a costly tax on a site’s operating budget, sanctioning scraping invites a host of malfeasance.  For instance, a bad actor might emulate the user agent of a known competitor in order to spoof said company. The impersonator could then send an untenable number of requests in order to sully the reputation of its competitor. Alternatively, passing as said competitor, the impersonator could potentially evade the target site’s filters. Neither of these misbehaviors is plausible when requests are made to an authenticating API.  

Conclusion and Thoughts for the Future

While scraping can undoubtedly be used to pursue truth, it also threatens to undermine stability and security affecting business interests, investor willingness to participate, and the health of the corporation more generally. The difficulty of reconciling these two positions in this case stems in part from forcing the CFAA statute, an old instrument, into the context of automated web scraping as the lifeblood of entire corporations. Imposing CFAA liability, could, as the EFF suggests, “potentially criminalize all automated ‘scraping’ tools—including a wide range of valuable tools and services that Internet users, journalists, and researchers around the world rely on every day.” This tension emphasizes the importance of judicial decisions and new regulatory regimes that reflect the nuance required by developing technologies. As policymakers undertake efforts to draft state and potentially federal privacy legislation, striking the right balance between the interests of all stakeholders will be key.

 

Comments are closed.