Blog

What are the ethics of web scraping?

Someone recently asked: "Is web scraping an ethical concept?" I believe that web scraping is absolutely an ethical concept. Web scraping (or screen scraping) is a mechanism to have a computer read a website. There is absolutely no technical difference between an automated computer viewing a website and a human-driven computer viewing a website. Furthermore, if done correctly, scraping can provide many benefits to all involved.

There are a bunch of great uses for web scraping. First, services like Instapaper, which allow saving content for reading on the go, use screen scraping to save a copy of the website to your phone. Second, services like Mint.com, an app which tells you where and how you are spending your money, uses screen scraping to access your bank's website (all with your permission). This is useful because banks do not provide many ways for programmers to access your financial data, even if you want them to. By getting access to your data, programmers can provide really interesting visualizations and insight into your spending habits, which can help you save money.

That said, web scraping can veer into unethical territory. This can take the form of reading websites much quicker than a human could, which can cause difficulty for the servers to handle it. This can cause degraded performance in the website. Malicious hackers use this tactic in what’s known as a "Denial of Service" attack.

Another aspect of unethical web scraping comes in what you do with that data. Some people will scrape the contents of a website and post it as their own, in effect stealing this content. This is a big no-no for the same reasons that taking someone else's book and putting your name on it is a bad idea. Intellectual property, copyright and trademark laws still apply on the internet and your legal recourse is much the same. People engaging in web scraping should make every effort to comply with the stated terms of service for a website. Even when in compliance with those terms, you should take special care in ensuring your activity doesn't affect other users of a website.

One of the downsides to screen scraping is it can be a brittle process. Minor changes to the backing website can often leave a scraper completely broken. Herein lies the mechanism for prevention: making changes to the structure of the code of your website can wreak havoc on a screen scraper's ability to extract information. Periodically making changes that are invisible to the user but affect the content of the code being returned is the most effective mechanism to thwart screen scrapers. That said, this is only a set-back. Authors of screen scrapers can always update them and, as there is no technical difference between a computer-backed browser and a human-backed browser, there's no way to 100% prevent access.

Going forward, I expect screen scraping to increase. One of the main reasons for screen scraping is that the underlying website doesn't have a way for programmers to get access to the data they want. As the number of programmers (and the need for programmers) increases over time, so too will the need for data sources. It is unreasonable to expect every company to dedicate the resources to build a programmer-friendly access point. Screen scraping puts the onus of data extraction on the programmer, not the company with the data, which can work out well for all involved.