Do not protect the site from scraping: resistance is useless

Original author: Gajus Kuizinas
  • Transfer
Over the past decade, I have implemented many projects related to content aggregation and analysis. Often aggregation involves the removal of data from third-party sites, i.e. scraping. Although I try to avoid this term. He turned into a kind of label, which is associated with many misconceptions. The main misconception is that web scraping can be blocked using X, Y, Z.

tl; dr; It is impossible.

From a business point of view


Last week I met with a senior leader from the industry in which I am developing my GO2CINEMA business . Without a doubt, this is one of the most intelligent and knowledgeable people in the film industry.

The GO2CINEMA business model is based on aggregating information from different sources about the schedule of sessions, available seats and the cost of tickets, as well as fulfilling requests for the purchase of tickets on these websites on behalf of the user.

I consulted this person about finding an investment. He offered his help and asked to prepare an analysis of all the ways to block my current business, including scraping content (from a technical and legal point of view). I prepared the necessary documents and shared with him before our meeting. His reaction was something like this:

Yes, thorough research. But still there are ways to block you. * grins *

No, man, there are no such ways.

Real users are no different from bots


People far from IT have a kind of idealized idea of ​​programming like in computer games of the 80s - you put on a virtual reality helmet and plunge into the Web. In fact, all information and all interactions are zeros and ones. There is nothing human here. There is no difference between data entered by a computer or a person.


Inspection of web traffic

I will say in a simple way. While visitors can access the content on your site, the bot will be able to get the same access. All technological solutions against scraping will hinder real users to the same extent as bots.

If you do not believe it, let's look at all the technical ways in which a website can try to block a bot using my business as an example.

Technical countermeasures


Although some of these points will seem silly to a technician, investors really expressed all these concerns - and I had to answer. So bear with me.

User Agent Lock


Each HTTP request contains HTTP headers, including the user agent  - the identifier of the HTTP client. Therefore, the cinema can identify bots by information from HTTP headers - and block them.

Solution: fake HTTP headers to simulate real users.

Example: GO2CINEMA bots use HTTP headers that mimic real user sessions (for example, in the Google Chrome browser). HTTP headers are randomized between scraping sessions.

Conclusion: It is not possible to block GO2CINEMA bots using HTTP metadata sent by the client, such as HTTP headers, without blocking real users.

IP Blocking


The cinema may try to determine and block the IP addresses of the GO2CINEMA bots.

Solution: “fake” IP addresses (using proxies).


Mass Identification

Example:

GO2CINEMA uses a combination of request scheduling and IP rotation to avoid identifiable bot behavior patterns. Here are some of the precautions:

  1. IP address randomization
  2. Highlight IP addresses that are geographically as close to the movie theater as possible.
  3. Saving a dedicated IP address for a scraping session.
  4. The proxy pool changes every 24 hours.

It is worth noting that in the current installation there is one drawback: IP addresses (proxies) are registered to various data centers, and not to home addresses, like real people do. Theoretically, a movie theater could get a list of subnets of all UK data centers - and block them. This will successfully block the bots in the current settings. However:

  1. It will be costly. For example, such services are provided by MaxMind (a database with IP addresses of anonymizers, proxies and VPNs, the price was not disclosed) and Blocked ($ 12,000 per year).
  2. This can lead to the blocking of real users.

Netflix is an example of a provider that blocks the IP addresses of known VPNs and proxies.

If movie theaters start blocking the IP addresses of data centers, you will have to use the IP of home users through proxies of home addresses like Luminati . This approach has two drawbacks:

  1. Cost (our current traffic will cost £ 1,000 per month).
  2. Reliability. The performance and speed of home address proxies is difficult to predict.

Some cinemas have already tried to block the IP addresses of our bot. A source told us that Cinema X thinks (or at least thought) that it successfully blocked our IP addresses. But this is not so. The activity of the GO2CINEMA bot did not stop. Cinema X seems to have blocked someone else who was collecting the same data.

It is important to emphasize that theoretically it is possible to distinguish HTTP requests from people and bots by surfing patterns (see the “Invisible CAPTCHA” section). But it will be very difficult to determine HTTP requests from GO2CINEMA bots (for the reasons indicated in the section "Blocking the user agent").

Conclusion: it is extremely difficult to block GO2CINEMA bots by the blacklist of IP addresses, because 1) it is extremely difficult to identify bots and 2) we have access to a large number of IP addresses of data centers and home users.

IP blocking will not prevent our bots from continuing to scrape movie theater sites.

Using captcha


Cinema can add captcha to restrict access to certain sections of the site (for example, displaying occupied places in the audience) or to limit certain actions (for example, completing a payment transaction).

Solution: API for solving captcha.


The fully automated Turing public test for distinguishing between computers and people (CAPTCHA) CAPTCHA

will only add inconvenience to ordinary users. All captcha methods (including for reCAPTCHA from Google) are easily bypassed with the help of third-party services like 2Captcha . In these services, real people solve the tasks assigned to our bot. The cost of services is minimal (for example, 2 pounds for 1000 tasks).

Conclusion: adding captcha will not prevent our bots from continuing to cinema theaters.

Invisible CAPTCHA


Cinemas can use mechanisms to identify and block bots based on behavior (the so-called "invisible captcha").

Invisible captcha uses a combination of various variables to assess the likelihood that a particular client’s interactions are automated. There is no single recipe for how to implement this. Different providers use different parameters to profile users. This service is provided by some CDNs (e.g. Cloudflare) and traditional CAPTCHA providers like Google reCAPTCHA.

According to Schumann Gozemajumder, the former head of the click fraud recognition department at Google, this opportunity “creates a new type of problem that very advanced bots can still work around, but pose much less complexity for a real person.”

Conclusion: identification by behavior profiles will not prevent our bots from continuing to cinema theaters, this is just another problem that needs to be circumvented.

Email Verification


The cinema may require verification of the email to perform certain actions, for example, for booking.

Solution: disposable "boxes".


Now I just add pictures to the article because I like the author - Kidmograph

Example:

Now movie theaters already require an email address to buy a ticket. GO2CINEMA uses the domain name go2cinema.mailto confirm the reservation. A new email address is created for each transaction (for example, john1@go2cinema.mail). Emails sent to generated mailboxes are not available to GO2CINEMA users.

The current approach has advantages:

  1. It limits the ability of movie theaters to track individual user activity.
  2. It prevents movie theaters from sending marketing letters to our users.

The disadvantage is that cinemas can easily identify and block service transactions.

If movie theaters actively block an email domain go2cinema.mail, you can do one of the following:

  1. Buy in bulk thousands of cheap domains.
  2. Use any of the existing services that provide temporary email addresses (for example, Mailinator ).
  3. Create temporary mailboxes from existing large providers (Yahoo, Gmail, etc.).
  4. Provide real user addresses.

Conclusion: checking email will not prevent our bots from continuing to cinema theaters.

Mobile verification


The cinema may require the user to provide a valid mobile phone number to complete the transaction.

Solution: disposable mobile numbers.

If the verification process includes a call or message (for example, you need to enter the number received on the phone), then you can use any of the virtual mobile operators (for example, Twillio ) to issue temporary mobile numbers.

Unlike temporary email addresses, a virtual phone number is relatively expensive (for example, 1 pound per month per number).

On the other hand, the addition of mandatory mobile verification is unlikely, because it will adversely affect the operation of the cinema for several reasons:

  1. The cost of SMS verification.
  2. Loss of customers who do not have a mobile phone number or who do not want to share it.
  3. Reducing the reservation dialogue in terms of the product (from practice this is a general observation, especially in e-commerce).

It will be extremely extreme and unprecedented if the cinema goes for it.

Conclusion: mobile verification will not prevent our bots from continuing to cinema theaters.

Blocking bin


Cinema may block our bank identification number (BIN).

Solution: sue or issue cards at a regular bank (e.g. Barclays).

Example:

GO2CINEMA uses virtual debit cards to purchase tickets. A virtual MasterCard is issued for each user through the Entropay service . Entropay works like a bank, that is, all its cards start with BIN 522093. Theoretically, a movie theater can block this BIN.

But blocking the BIN will violate the contract with the payment gateway. Each such agreement includes a rule to accept any cards. In the case of MasterCard, this policy is specified in clause 5.10.1 of the rules:

5.10.1 Acceptance of all Cards
The outlet must accept all valid Cards without any exceptions or preferences if they are presented for payment in the proper manner. The outlet must create equal conditions for all customers who wish to pay using the Card.

For the same technical reason, movie theaters cannot block MoviePass in the USA:

“We fully comply with MasterCard rules, and AMC has signed agreements with both their processing company and MasterCard. To block us, they essentially have to refuse to accept MasterCard. ” - source

Please note that the acceptance of all cards in Europe is different from the USA. In Europe, an outlet is allowed to completely block a type of card, for example, all prepaid cards.

Conclusion: BIN blocking will not prevent our bots from continuing to cinema theaters.

Change site structure


Cinema can change the structure of the site without warning.

This assumption is based on many guesses about how our scrapers work. In most cases, modern scraping methods are independent of the site structure. More on this in the next article.

If we assume that our scraper depends on the structure of the website, then:

  • Changes to the site’s structure are rare.
  • This is not much different than API changes.
  • Our systems will notify you as soon as this happens.
  • This will also affect real users.

Conclusion: changing the structure of the site will not prevent our bots from continuing to cinema theaters.

Protecting an API with an API Key


Another businessman (the owner of the cinema) suggested that the cinema could block GO2CINEMA by restricting access to the API using the API key.

He: Cinema can simply upgrade its API to require an API key.
Me: What API?
He: API for access to film screenings.
Me: Is this information published on the site?
He: Yes.
Me: Then the browser client must have access to this API to view content on the site.

On this particular site, which he cited as an example, the API key was hardcoded in the source code.

Conclusion: restricting access to the API using the API key is not an effective strategy for restricting scraping if it is an open API.

Summary


Currently, there are no technical barriers to blocking the access of a particular bot to content on the movie theater website. All of these mechanisms act only as a deterrent.

It is worth emphasizing that although none of these methods blocks bots, the introduction of some or all of these mechanisms will cost the cinema expensive in terms of 1) the cost of technical development and 2) inconvenience to real users.

Lock 98% Scrapers


Although you cannot block all the scrappers, you can prevent or block most of them using a combination of the above methods.

Is it worth your effort? The answer depends on the following factors:

  • What impact do scraping bots have on your website / business?
  • Will it affect real users?

Most often, the answer is that it's not worth it.

Legal obstacles


The site owner cannot technically block the bots. But does he have legal tools for this?

Short answer: no (or extremely unlikely, difficult and will take many years).

In the next article, I will talk about more than a dozen legal cases that I used to assess the legal climate in Europe from the point of view of scraping business.

Concluding observations


It seems that many readers have decided that we unleashed a technological war with movie theaters (see Reddit comments ). This is not true.

With the exception of rumors that Cinema X attempted to block our IP addresses (mentioned in the IP Blocking section), not a single cinema tried to block us using any of these methods. The purpose of this article is to share a brief description of what-if scenarios prepared as a contingency plan for attracting venture financing.

Most cinemas remain dinosaurs, which still use fax machines for everyday communications and Excel spreadsheets to manage schedules. They cannot afford or do not see the need to have an API - and they are fully aware and glad that other companies are copying session schedules from their website.

If you are going to scrap other people's content, first ask the site owner if he has an API. This will save time and money for you and the site owner.

Also popular now: