andrew_tch April 26, 2010 at 21:19

Two years with crawlers (web-mining)

Disclaimer: this topic may be partly self-promotion, “water” and nonsense, but most likely it’s just a classification of information and experience gained over two years of work in the field of scraping, for yourself and those who are interested.

I do not chase after karma, it is enough.

Under the cut - a small post about the modern market of crawlers / parsers, with classification and features.

Subject

We are talking about "spiders", or programs that collect information on the Web. Spiders can be different - most climb the web, some pars torrents, some fido / ed2k and other interesting things. The essence is one - in the form convenient for the customer to provide the information he needs.

Unfortunately, S. Shulga ( gatekeeper ) overestimated this industry - information mining is a popular thing, but nonetheless, AI technologies are not used there much, and it’s far from automatic advisers. Basically, spiders are divided into several categories, distinguished by the complexity of the methods used.

Classification

Simple crawlers

Cheap, simple scripts, usually in PHP, the task is to sequentially deflate the site, save prices, attributes, photos to the database, it is possible to process them. You can look at the cost of projects on the bases of freelancers, it is usually ridiculous. Mostly one-time projects. Banned by IP or query speed.

Group crawlers

I implemented a similar project for cenugids.lv. In this case, many (50+) crawlers use the same code base, or rather, this is one crawler with interfaces for several sources (for cenugids.lv these were shops). It is mainly used to collect information from similar sources (forums, shops).

Behavioral Crawlers

It means disguising a bot as a person. The customer usually asks for a specific behavior strategy - to collect the information only at lunchtime, at 2 pages per minute, on the working week 3-4 days a week, for example. The TK may even include an interruption for a “vacation” and a change in the “browser version” in accordance with releases.

Caching Crawlers

Technically the most cumbersome solution, used to scrape something the size of c ebay. Usually consists of several parts - one picks out from the source places that are worth walking around (for example, for a store these are categories and pages). This process is quite rare, because this information is pretty constant. Further, at random intervals, the spider walks to “interesting places” and collects links to data (for example, goods). These links are again processed with random delays and entered into the database.

This process is not periodic, it is ongoing. In parallel with it is checking old links - i.e. say every 5 minutes we select 10 cached goods from the database and check whether they are alive, whether the price and attributes have changed.

In this technically most cumbersome solution, the customer receives data not about the snapshot of the source at some point, but more or less fresh information from the base of the crawler itself. Naturally, with the date of the last update.

Problems and Methods

Detection

It’s easy enough (at least to look at the statistics) to understand that your site is being pumped out. The number of requests equal to the number of pages - what could be more noticeable? This is usually bypassed by the use of a caching crawler and scrolling graphs. Naturally, you can not get out for traffic to the target site.

IP Ban

The simplest thing that you can run into at the beginning of the war with the administrator. The first way out is to use a proxy. Minus - you need to maintain your infrastructure, update the proxy list, transfer it to the customer, and make sure that it does not fail at one time. With one-time orders, of course, this disappears. Although, it took me a week to implement such an infrastructure along with the interfaces.

The second option is Tor. Great P2P network of anonymization, with an ideal interface where you can specify the desired country and exit point. Speed, in principle, with caching decisions, is not so relevant. The performance is good enough - I still have one client banning all exit points, iptables rules are already over 9000 (at the time of writing 9873), but there is still no result ...

Registration / Authorization

A trivial problem to solve as you gain experience. Logged in / recorded cookies / logged in / parsim. CAPTCHAs break just as well.

Going to infinity

A parser can go crazy if the site somehow generates an infinite number of links. For example, adding osCsid (OsCommerce SessionID) / PHPSESSID each time can cause the crawler to perceive the link as new. I saw stores that generally generated pseudo-random links during the refresh (thus, for search engines, one product was placed on 50+ pages with different URLs). Finally, bugs in the source can also generate an unlimited number of links (for example, a store that showed the next link and +5 pages from the current one even somewhere on a 7000+ blank page).

Encodings

Oddly enough, the biggest problem is encodings. cp1251? HTML entities? FIVE kinds of “spaces” in a unicode table? And if the customer asks for XML, and one wrong character kills simplexml tightly?

The entire list of encoding errors is probably too lazy for me to indicate. I will say simply - in post-processing of data, in my crawler, encoding processing is almost exactly half.

Platform

People love PHP. Usually PHP + simplexml, or PHP + DOM / XPath. Xpath is generally indispensable, but PHP systems have two big disadvantages - they eat and fall off. 512 megabytes per crawler is normal when using mbstring, not to mention coredump just when trying to create a +1 tag in XML. When processing small sites, this is inconspicuous, but when 50+ megabytes are downloaded from the source at a time ... Therefore, basically, serious PHP players leave.

My choice is Python. In addition to XPath, there are libraries for ed2k, kazzaa, torrents, any databases, perfect string processing, speed, stability and OOP. Plus, the ability to embed your own mini-server there for data delivery to the client allows you to not put too much on the server and remain invisible - for example, you did not manage to pick up the issue 15 minutes after midnight - kick yourself.

Conclusion

If anyone is interested, I can describe in a separate article methods for breaking captchas, bypassing user-agent protection, analyzing server issuance, and parsing non-web sources. Have questions? Welcome to kamenty or PM!

Tags: