It is a data-driven world. Sourcing and consuming external data is the necessity of many businesses. Not only that, leveraging publicly available data is the only way to survive and undercut competition for many businesses. While web scraping is the key to unlocking access to web data, there is lots of confusion, and myths, around the legality and ethics of web scraping. This article aims to address those and bring clarity to the topic. It also goes over the best practices you should follow, as well as the legal and ethical boundaries you should respect, to get the best out of web scraping while keeping it safe and legal.
Web scraping is a great way to source useful external data for data-driven businesses around the globe. However, there is lots of confusion on the legality of web scraping. If you type the question, “Is web scraping legal?” on Google, you’ll find opposing views on the topic, depending on who is answering it. While data scraping companies will try to paint an optimistic picture to get more business, anti-scraping service providers will equate it with data theft to sell their solutions.
The truth is that almost all big companies use web scraping, one way or the other, to collect data about their competitors and markets. They do not see it as unethical for their own use. However, it may irk them when they find others scraping their own websites.
In this blog, I will try to take an unbiased view. Things may not always be black and white and may be open to interpretation in some situations, though. So, I would recommend seeking legal advice when in doubt. This article does not intend to provide legal advice.
Before we look into whether web scraping is legal or illegal, let’s understand what it is.
The same operation could be performed by humans, but it will be much slower. An example is to download product attributes for a few thousand items on Amazon.
So, is it legal to scrape a website, then?
There is no law in the US, or elsewhere, that says web scraping is illegal. So does that mean web scraping is legal? It depends on what data you are scraping and how you are using it.
Web Scraping is simply a tool to automate what humans can otherwise do manually. A tool itself cannot be legal or illegal. It’s the use of the tool that can be legal or illegal.
Data scraping has been in use for a long time. Search engines use bots to discover and index web pages. Price comparison websites use scraping to inform their consumers before they make purchases. You could even scrape your own website for analytics. At the same time, bad actors may use scraping to conduct fraudulent activities such as data theft or DDoS attacks.
Though web scraping is not illegal, it’s a technology you should use with care. There are boundaries that you would want to respect to make sure you don’t get into legal trouble. If you scrape smartly, abiding by the ethical web scraping practices, it’s highly unlikely for it to be held against you even if the websites you are scraping do not like it.
It comes down to three things that decide legality:
The following section will help you evaluate your use case and determine whether your web scraping use case lies in the safe zone or not.
Asking yourself the following 6 questions, pertaining to the generally accepted web scraping ethics, will help you stay compliant.
Personal data scraping could be a potentially unsafe area where you need to be extra cautious. Different jurisdictions have different laws governing access and use of personal data. While it might be okay to scrape personal data in some US states, you may get into trouble for doing the same in California. Wherever you are, check your local regulations before you scrape personal data.
Extending to the territorial laws, even if you are situated in a place where scraping data is okay but you scrape the data of a person situated in the EU, for example, the laws in the EU may apply to you. The EU is very particular about their citizens' privacy, so you may want to review the General Data Protection Regulation (GDPR) before scraping their information.
Next, you may ask, what is personal data?
According to the California Consumer Privacy Act (CCPA), personal information is the data that can identify or be linked to an individual or household. It includes, but is not limited to, a person’s name, birthday, contact details, IP address, and audio and video recordings.
On the bright side, you won’t typically need to worry about personal data when scraping for price intelligence or competitive analysis.
However, when scraping reviews and social media data, personal data is often a consideration. Usernames, names, profile pictures, among other things can be categorized under personal data in this case. In such scenarios, there are multiple ways to avoid web crawling legal issues. For example, you can anonymize the data by omitting fields like username, emails, etc.
When you’re working with CrawlNow, we carefully review your specific use case and work hand in hand with you to make sure you comply with laws related to personal data, including GDPR, CCPA, and your local jurisdictions.
Before scraping a website, you should know what is public data and what is not. Websites generally keep certain data available to the public. As long as you are scraping only the publicly available content, you should generally be safe. However, there are a few other things to keep in mind that are discussed in the following sections.
Non-public data is something that is not accessible to everyone on the web. You will typically need to login to view this data. If the data is only available after you have logged in, it directly means that it is not available for public access. If you scrape non-public content, you may be inviting trouble, but it depends on the context.
Facebook, for instance, may allow you to scrape data in certain conditions, but only after “Facebook’s express written permission”.
A lot of the content available on the internet is protected by some kind of copyright. Scraping and using copyrighted material irresponsibly may fall under copyright infringement. Music, news, blogs, research papers, movies, images, databases, and logos are some potentially copyrightable data. Even when not explicitly declared a “copyright”, every private, original work is automatically copyrighted for the author under the Berne Convention.
However, not all information on the internet can be flagged under copyrights. Some of it is plain facts, and consequently a safe resource for web scrapers. Product name, product descriptions, price data, and the number of sales or views, which is the core input of price intelligence and competitive analysis, are some examples of plain facts.
Images, videos, and databases are some of the content types that may come up in web scraping projects. In such cases, it’s important to look at the use case, since you may be able to scrape copyrighted data in certain situations, depending on how you use it.
Aggregators, for example, typically use snippets from different sources and attach a link that directs the viewer to the original source, i.e. the copyright holder. In many situations, you may want to scrape copyrighted data for analysis. In many jurisdictions, these may be considered as ethical web scraping. However, scraping copyrighted data and publishing it as your own is undoubtedly illegal.
Web scrapers are preferred over manual data extraction because they can fetch you data in mere seconds. Though web scrapers are efficient tools, you should not hit a website’s server with too many requests in a small interval.
Scraping websites aggressively can overload the website’s server and may even crash them if the website has no rate-limiting in place. In this case, you damage a website’s functionality and may be held liable under the “Trespass to Chattels” law (more on this later).
Most websites specify a “crawl-delay” directive in their robot.txt file (more on this later, also). crawl-delay 10 means that a bot should wait at least 10 seconds between two consecutive requests.
If the crawl-delay directive isn’t specified by the website, 1 request per 10 to 15 seconds is a reasonable crawl rate in most scenarios. As long as you stay within the reasonable crawl rate, there’s no reason to get into web crawling legal issues.
Agreements can be either browsewrap or clickwrap. Browsewrap agreements are concluded upon visiting the website. However, in many cases, they either appear inconspicuously at the bottom of the screen or within a drop-down menu. In such cases, they are generally not binding by law. However, if the agreement appears as a pop-up window or the website provides a link to the ToS at a noticeable position, it may be enforceable. You’ll better understand the legal theory behind browsewrap agreements by looking at a summary of related court cases.
In contrast, clickwrap agreements are those that require the user to tick a checkbox or click a button. Below the button or checkbox, something around the lines, “By clicking, you agree to our Terms and Conditions” will be written. After you take the required action, the Terms and Conditions are legally binding on you and the court may enforce them.
If you want to use web scraping tools, you should know about robots.txt. Consider it as an instruction manual that the website places for bots.
The “Disallow: /” command tells the robots which pages the website owner does not want them to visit. Minimum allowed delay between successive requests may also be mentioned under the “crawl-delay” command.
It is generally a good idea to visit the website’s robot.txt file before scraping it and respect the directives laid down in it.
Let’s look at some important laws governing web scraping and some high-profile judgments that carve the present and future of the data collection world.
Very recently, HiQ vs. LinkedIn case came out as a landmark for web scrapers. LinkedIn came into dispute with a small data analytics company, HiQ Labs, by sending an official letter demanding the latter to cease all scraping activities on LinkedIn. The letter also stated that LinkedIn had blocked HiQ Labs from accessing public profiles.
Did HiQ back out?
No. HiQ Labs took the case to the court saying scraping publicly available data is not illegal, and blocking it gives big companies like LinkedIn the unfair advantage of hoarding public information.
In September 2019, US Ninth Circuit gave an unprecedented decision in favor of HiQ, stating that collecting publicly available data was not a violation of CCFA. In June 2020, the Supreme Court granted LinkedIn the petition for writ of certiorari and sent the case back to the 9th circuit for further consideration. Though the case is still pending, a decision in favor of HiQ could mean a groundbreaking victory for ethical web scraping.
“Facebook vs. Power Ventures” is another well-known dispute in the web scraping community. It began in 2009 by Facebook taking legal action against Power Ventures for extracting Facebook’s user information and displaying it on their own website. Facebook alleged that the action caused violations of CAN-SPAM Act, CFAA, DMCA, UCL, and Copyright infringement.
What happened next?
Though the court dismissed other claims, three claims, a violation under the CAN-SPAM Act, CFAA, and California Penal Code, were held for the final decision. Finally, the decision went in favor of Facebook and the court ordered Power to pay Facebook a hefty sum of $79,640.50.
Comparing the two cases, “HiQ vs. LinkedIn” and “Facebook vs. Power Ventures”, it’s easier to understand where data scraping may or may not be legal. Facebook controls access to its data by requesting login and password. When you scrape their user profiles, you scrape behind the login. Is data scraping legal in this case? Power Ventures was sued for it, what do you think!
In contrast, LinkedIn’s public profiles are accessible directly through the browser. You don’t need to login to view these profiles. Is scraping legal here? According to how the case is turning out in court, there’s a good chance it could be.
CFAA is another important law that might be relevant when considering the legality of your scraping activity. The act says that intentionally accessing a computer system without either authorization or in excess of authorization may be subject to legal action.
So what does that mean to web scrapers?
Though the HiQ vs LinkedIn case is sent back for revision to the Ninth Circuit Court, the preliminary decision of the court suggests that when a server’s data is publically available, accessing it may not be a violation of CFAA. But we’ll have to wait for the final decision on the case to know for sure.
Besides how the ruling on the HiQ vs. LinkedIn case turns out, CFAA may still apply on web scraping in cases where non-public data is involved. Websites that hold certain information behind the login may hold you liable for scraping it under CFAA.
Everyone knows that trespassing someone’s property is illegal. Digital trespass is equally illegal. A website is the property of the website’s owner. Trespass To Chattels is a law that governs the wrongful use of someone’s digital property.
When you enter a website, which is the personal digital property of the website’s owner, you should behave in a responsible manner. If irresponsible behavior when using a website causes any damage to the website’s condition, quality or value, you may be held liable under Trespass To Chattels. For instance, if a high crawling rate crashes the website’s server, the website’s owner may file a lawsuit under “Trespass To Chattels”.
That being said, as long as you scrape a website responsibly, and make sure no damage is inflicted in any way, you wouldn’t have to worry about violating Trespass To Chattels.
Fair Use is a legal doctrine in the United States that permits scraping and use of copyrighted content in certain situations. Under this law, certain uses, including criticism, research, teaching, and news reporting, of copyrighted material may be considered “fair use”.
However, there are four factors that govern whether a use case falls under fair use:
So what does it come down to? Is web scraping legal or not? We firmly believe it is. It is nothing more than the automation of work, done otherwise by humans.
You just have to respect certain legal boundaries and best practices. Respect robots.txt, don’t swamp the website with unreasonably high crawl rates, be extra cautious with copyrightable content and personal data. Seek professional legal advice whenever in doubt.
Generally, partnering with a professional web scraping service makes it easier to follow these principles.
When conducted in a responsible manner, web scraping is a powerful technology for gathering information, and even creating new information, on the internet. From content aggregation and competitive research to creating datasets for training machine learning models, the use cases for web scraping are endless.
Speak to a CrawlNow data expert today to explore new opportunities for using data to fuel growth for your business.
CrawlNow has got you covered
In case you would like to dig further on certain topics, here’s a list of some enlightening texts you can read:
With a large percentage of retail activities taking place online, e-commerce web scraping offers massive potential for online retailers to differentiate and grow. Web data can enable online retailers to optimize their strategies by tracking competitors, enrich product listings, understand customers’ needs, and stay on top of market trends. This post highlights a few key ways in which you can leverage publicly available data to beat your competition in the e-commerce and retail industry.