Parsing sites - and is it generally legal in Russia?

    According to one of the definitions, parsing is a parsing of information. To a person who is not involved in the specific tasks of collecting and processing information for Internet projects, this does not mean anything. And the very definition, only in general terms, indicates the enormous amount of work that hundreds of millions of people and tens of millions of robots (though virtual, but no less real) do all over the world every minute. But this task is common for a person - that comparing the prices of tickets online, that choosing the right electronics on the sites of stores ... Watching prices and promotions in a convenient mobile application of the supermarket closest to the house, none of us will even think of dubbing ourselves a parser.
    image

    Nevertheless, business parsing exists, works and, of course, is the subject of lively discussion at many levels of consideration: ethical, legal, technological, financial and not only.

    This article does not express a definite opinion, does not give advice and does not reveal secrets - here we will only consider some opinions on the example of the most interesting comments on one separate article about parsing (50k views and more than 400 comments!) On Habré, treating them from the perspective of experience in parsing web projects. In other words, we spent a lot of time and tried to bring together and classify together the most interesting comments of readers ... worldly wisdom, so to speak :)

    So, about parsing:

    "A matter of technology." Fantastic proxies and where they live.


    Just as the idea of ​​parsing itself is natural (it’s always interesting to see what the “neighbors” are doing there), the basic methods of its implementation are just as simple. If you want to know, ask, but if you want to know the actual values ​​of a large data array (whether it’s the price of goods, their descriptions, volumes available for ordering or hot discounts), you will have to “ask” a lot and often. It’s clear that it would never occur to anyone to try to collect this data manually (except for a large team of hardworking children from southern countries who were inspired not in the most humane way), so simple, effective solutions are used in the forehead: to “pile up” the site, set up the browser, collect bots - and “tap” the target site for indicators of interest, carefully write down the answers in a “notepad” in a convenient format, analyze the collected data, and repeat.

    Here are some approaches to the "parsing technique" from our readers and from us:

    1. “Selenium farm - and go!” (I mean headless browsers with BeautifulSoup-like solution like Selenium / Splinter). According to our reader, he wrote a small site on a docker swarm cluster to his wife to monitor seller’s sites (she is an importer) so that they do not violate the RRC / MRC policy (recommended retail prices). According to the author, everything works stably, the parsing economy converges - "all costs are 4 nodes for $ 3." True, the proud author has only about a thousand products and dozens of sites in parsing, no more :)
    2. “We launch Chromium and everything is OK, it turns out 1 product in 4-5 seconds can be taken ...”. It is clear that not one admin will be glad of the jumped load on the server. The site, of course, is for this purpose necessary to provide information to all those who are interested, but “there are many of you, but I am alone”, therefore, those who are especially eager to be interested are, of course, ignored. Well, it doesn’t matter: Chromium comes to the rescue - if the browser is knocking on the site in the “only ask us” mode - it can be done without waiting in line. Indeed, in the general array of parsing tasks, parsing of html pages is done in 90% of cases, and in “especially difficult cases” (when sites are actively protected, like the same Yandex.Market asking for captcha), it is Chromium that manages it.
    3. "Clean proxies with your own hands from LTE routers / modems." There are quite working ways to configure clean proxies suitable for parsing search engines: a 3G / 4G modem farm or buying white proxies instead of a bunch of random dirty proxies. It is important what programming language is used for such industrial parsing - 300 sites per day (and the correct answer is .Net! :). In fact, the Internet is full of sites with open proxy lists, 50% of which are quite working, and it’s not so difficult to parse proxy lists from these sites, then to parse other sites with their help :)) Well, we do it.
    4. Another case in favor of Selenium: “I do parsing myself (but not in RuNet, but I catch orders on my favorite upwork.com, there it is usually called scraping, a more suitable term, IMHO). I have a slightly different ratio, somewhere around 75 to 25. But on the whole, yes, if it’s laziness or is difficult - so far no one has dodged selenium :) But of the several hundred sites that I had to work with, it never came to recognition images to get target data. Usually, if there is no data in html, then they are always pulled in some json (well, actually, we have already shown an example below).
    5. "Python Tamers." And another reader’s case: “In my previous work I used Python / Scrapy / Splash for 180+ sites a day of different sizes from prisma.fi and verkkokauppa.com to some little thing with 3-5 products. At the end of last year, we rented such a server from Hetzner (https://www.hetzner.com/dedicated-rootserver/ax60-ssd) with Ubuntu Server on board. Most of the computing resources are still idle.
    6. "WebDriver is our everything." Engaged in general automation (where parsing already falls), as reliable as possible (QA tasks). A good workstation, a dozen or two browsers in parallel - the output is a very evil, fast thresher.

    The "gentleman's set" of a hovering one - 4 virtual machines, unlimited traffic, 4 processors on each, 8 GB of memory, Windows Server ... So far, enough for each new batch of conditionally 50 sites - you need your own virtual machine. But it depends a lot on the sites themselves. Visual Studio also has System.Net, which actually uses the Internet Explorer installed in Windows. It works too.

    “How to protect yourself (from parsing) in your mind? No way, we’ll crawl anyway ”


    Parsing business ideas, talking about our business, are constantly thrown to us.

    1. Issue Yandex parsing, as do many SEO services. “There is more demand for this, more money. True, they mainly sell the whole SEO analytics system. ”But we didn’t parse the delivery - we didn’t ask, and there will be a captcha immediately after 100 requests, we need clean proxies, and it’s difficult to get them or expensive, it’s not so profitable ... themselves, big players are far from easy to hold, and readers share this with us (we ourselves DO NOT parse Google and Yandex). According to experience, Yandex, Google and similar large corporations have a certain base with subnets of data centers (after all, proxy databases are updated, and major players subscribe and ban them). Thus, the raised proxy network at the IP addresses issued to the data centers flies perfectly to the ban with the issuance of captcha and other quirks. As a result, there are only illegal options with the purchase of proxies from the owners of botnets and a similar "dirt", in which case you will have a real user ip. And even so, such corporations really need you to have “settled” cookies with which you have already “crawled” for some time on sites where they can track you (for example, hit counts). But how do they distinguish parsers from NATs in sleeping areas? Conditional 100 requests are nothing at all.
    2. Protection from parsing: removing the “great and terrible” from consideration, we will focus on us, “mere mortals”. If there are those who are engaged in parsing, there must be those who will try to prevent them from doing this. It is more interesting to play with living people: an element of rivalry appears, each side tries to outwit the other. And, since nobody still intends to collect information manually, they play who will make the bot the most similar to a living person, and who will be able to recognize these bots more efficiently while continuing to respond to requests from real users - the site is designed to help business , we are repelled by this. And, remaining within the framework of the task of business efficiency, one cannot but take into account the reasonable allocation of resources and the profitability of measures for, in fact, parsing and countering it:

      • You cannot protect yourself from parsing (except from “students”), but you can increase the threshold for spending on it (both time and money). As a result, the data that we protect (several sections of the site) is easier not to parse, but to go and buy a ready-made database, just like we buy it. There are tables of parser ip addresses lying on the network, showing captcha to this list at the entrance is not a problem. Similarly, generating id and classes, as mail.ru does, is also not a problem and does not require any large expenses. A new captcha from Google generally very accurately determines whether the robot or not. If there is a suspicion, cutting out the user and asking for a captcha is simple. In the end, no one has canceled the HoneyPot bait for catching the bot. Well, classic, replace the letters in the text, make masks, etc.
      • And here we will object to ourselves: perhaps, individually, all this will not help, but all together will complicate your life so much that it becomes inexpedient. Moreover, all these techniques generally do not require large expenditures. True, all of these techniques cost a lot, so in essence there is no protection. Dynamic proxies, services recognizing captcha by Indians, and selenium with a well-defined action algorithm. All that can be achieved - the development of the parser will cost more, it might scare someone away, but if the target site is not a catalog of one and a half pages of the local office of the "Horns and Hooves", then few people will be frightened by the increase in costs.
      • When defending, it’s always about using typical behavioral models of real visitors, plus systems that adequately identify “white” bots (Yandex, Google, etc.). And in order to adapt to a real visitor, you need to know a set of standard transition maps. And then a simple proxy pool when parsing is not enough. The system does not protect 100%, but it solves the task - according to viewing statistics, you can understand when the entire site was scanned. Either parsers or search engines do this. But search engines respond to robots.txt, but parsers do not.

    “Oh wow. If all people did everything wisely ... I think there would be 10 times more unemployed. Enough for your age. ”

    “Do I live environmentally? Yes, but in vain "


    1. In the moral and ethical plane of the consideration of the issue lies an important point relating both to the technical and legal aspects of parsing. The robots.txt file is concise in its simplicity and symbolic in its name, which our readers and we interpret in different ways:

      • Your activity as a “driver” of a bot is “ethical” exactly as much as your bot follows robots.txt of the site you visit. Not based on assumptions of the form “product pages do not close”, but literally imposing allow and disallow masks on the requested URLs. Missing robots.txt - interpret in your favor; present, but you violate it - definitely, you are maliciously using the site. Of course robots.txt does not have the force of law, but if you really “bake” it, it's not a fact that it will definitely pass by the lawyers. ”
      • Despite the fact that it’s impossible to negotiate with robots, it’s sometimes easier than with people, because in stores they hang signs “photos are prohibited”, and this is illegal. And unethical. “Just such a tradition. robots.txt is a technique. It's not about ethics. If you want to indicate that you do not want parsing, make a section like this: account.habr.com/info/agreement. I don’t know whether such a restriction will be legal, but at least you can express your wishes there in human language (or mention robots.txt), then you can talk about ethics. ”Our lawyers retort:“ In no way will such a restriction be legal. ”
      • We think simultaneously about parsing and about the further use of information. “Robots.txt is not so much about parsing, but about further publication (for example, in search results). If you want the data not to be received by anyone, then you should limit the circle of people who can see it. If you do not have curtains on the windows, then you should not go naked. It may be deliberate to look out the windows and ugly, but without curtains what claims? ”
      • Parsing ethics is neutral. It may be unethical to use the information obtained. In general, purely from an ethical point of view, everyone has the right to receive public information that is not private or special in nature and is not protected by law. Prices are for sure public information. Descriptions too. Descriptions may be subject to copyright and should not be posted without permission. But no ethics are violated, even if I will parse sites and make my own public site, which will reflect the dynamics of prices and comparison of competitors. It’s even ethical as it provides socially useful information. ”
    2. “It is possible to collect with your hands, but you cannot parse with a robot.” Any “evil” with due diligence and skill can be justified, and parsing even more so - especially since there are living examples of how they used it in every sense correctly, we quote our reader: “I was engaged in parsing a long time ago, but always asked to do parsing quite legally and morally correct. Several times, intermediaries requested that the wholesaler be parsed (for the sale of his goods), the wholesaler himself did not mind, but was not going to invest in the development of the API (or could not for technical reasons); once an intermediary of a Chinese store requested integration, but there the api of the Chinese store was so fucked up and limited that partly it was necessary to get info parsing; once the author and owner of the site and forum wanted to migrate from a free site, which “pinched” the database; he also did the integration of the site of the literary contest and its forum, so that when adding a new story, the topic on the forum would automatically appear (for technical reasons it could not be done otherwise). ”

    “Was the lawyer called? Quote can't be parsed "


    Regardless of which side you choose in determining the source of power: money or truth - one thing is clear, that, where money begins to be found, finding the truth becomes more and more difficult. Making a discussion about the possibility of acquiring everything and everything, including the law itself and its representatives, beyond the scope of this article, we will consider some legal aspects raised in the comments:

    1. “From peeping to theft is one step.” Even if everything that is not forbidden is allowed, then, our readers believe, “peeking into the keyhole is at least ugly, and if the client then also gives off the sparsial as his own, then this is direct theft. Of course, it is clear that in business everyone does this. But in a decent society it’s still customary to remain silent about this. ”However, parsing for someone and giving out the parsed as their own, as they say, are two big differences:“ You confuse the soft and the cold. We really provide parsing services. But it’s exactly the same way you can blame manufacturers, for example, weapons, for killing him. We do business, but in business there is one rule - is it legal or not. My point is ... If customers come to us and are willing to pay a lot to get data - it really is bad ... "
    2. "Made an application for a media site - nailed for a complaint." Forbes site, parsing, application on Google Play - what could go wrong? “At one time I decided to make an application for the Forbes website. To get articles from the site - parsed pages. I configured everything in automatic mode and made an application for Android. I posted the application on the market. A year later, a lawyer contacted me and demanded to remove the application, because I violate copyrights. I did not argue. It's a shame that Forbes itself does not have an application for their own articles from the site. There is only a site. And their site is slow, loaded for a long time and hung with advertising ... "
    3. “My database is my work under protection!” Copyright is another concept that can be devoted to a dozen pages of discussions (in addition to hundreds of thousands of existing ones), but not to mention it is also wrong. Our reader issued the concept: “Someone created a database of goods. Spent a bunch of resources on finding information, systematizing this information, putting data into the database. At the request of a competitor, you are parsing this base and giving it to the same competitor for money. Do you think there are no ethical issues? Regarding the rule of law, I don’t know how it is in the Russian Federation, but in Ukraine a database can be subject to copyright. ”

      However, the responsibility for using the service or product still lies with who acquires it / for what purpose: “... in Russia too. We provide data collection services. And for this service we ask for money. We do not sell the data ourselves. “By the way, I warn all clients that they can break the law if they use, for example, descriptions.”
    4. “Formally, you are right, but I found an article on you!” The Criminal Code of the Russian Federation (Article 146) describes only the extent of violations that allow classifying copyright infringement as a “criminal offense”. The rights themselves are described in the Civil Code - and the extent to which the act is classified as a “criminal”, regular parsing, such that the question arises of “will the site go down,” stretches out without problems. But aspects are important:

      • There, "large size" is not in the number of pages parsed, but in money. How do you rate parsing (and its regularity) as copyright infringement (!) In money? And how is it usually done in such cases and where can a fine of hundreds of thousands of dollars come from per copy of the film? The “lost profit” is calculated with the corresponding coefficient. You can calculate from some contracts - how much it will cost to buy the same information from you legally and from here “dance”. But, for starters, you should initially sell it (and not post it in the public domain), inventing a figure retroactively will not "drive" it. Although there are risks: do you know how much a commercial license for a conditional Consultant-Plus costs? As soon as you climb further than a dozen basic laws, you will quickly come across an offer to buy that very commercial version.
      • Our story is definitely not from a criminal case (and you don’t confuse the fine and the damage. So you broke a bottle of beer on the hooligan: the damage is 30 rubles, the fine is up to 1000 rubles, and then sue at least a trillion for a “lost profit” in a civil lawsuit, but this no longer a fine). You do not sell the price at all, what will the expert write? Specifically, and not "a good lawyer will pull without problems."

    Summarizing: “- How has parsing become equal to copyright infringement? - None. The violation is to order parsing from us, and then dump the content on your site. Putting a site is another article. ”

    Maxim Kulgin, xmldatafeed.com

    Also popular now: