Theory of Loaders
Over the past 5 years, I have written many loaders. These are the so-called programs that parse Old on source sites and save it to their base. Often they are a sequence of regular expressions, with the help of which the values are found in the necessary cells. Loaders can log in, can connect via proxies, and sometimes even recognize protective images. This is not the point.
The theoretical problem is that it is impossible to write a completely automatic loader. We can plug any info, but the base turns into a dump if the loader loses the classification of the source site. And when we start to save the classification, a problem arises.
Consider an example. Let there be a car site on which ads for selling cars from hundreds of other resources are loaded. I will declare the loader parsing, it will issue an array:
An automatic loader often works like this: it looks in the table of brands by name, if there is a ford, it takes the id of the brand, if not, it adds “ford” to the brand and takes its id. It does the same with the model and modification. Then it adds an ad with the received id-shniki. Such a system is bad in that there is sure to be an announcement in which “FORD” or not “VAZ” but “VAZ” or “AvtoVAZ” or not “St. Petersburg” but “St. Petersburg” will be in place of the brand, St. Petersburg, St. Petersburg. Smart Google will understand that these are synonyms, but our stupid loader, checking the names character by character, no. The result is a mess in the tables with classifications.
Trying to minimize the manual labor of the Mongol / moderator, I came up with such an algorithm.
First of all, the loader consists of two parts.
The first is loader_pages.
The script scans pages with lists of ads like these http://cars.auto.ru/cars/used/ford/focus/ and stupidly collects links to individual ads. + Finds links to page transitions and recurses through them. I found a link to declare it - added it to the database or, if it has already been added, updated the “last found date” to the current one. This is necessary so that (the loader works hourly) to delete objects whose link location date is quite old (this means that the link has not been found, which means the object has been deleted from the source).
The second is loader_offer.
Takes from the database not yet processed links, loads html, parsing. Gets an array of type
Loads a compares plate. It contains comparisons that will be manually processed by the moderator. The plate consists of the fields:
In our case,
If the corresponding comparison has already been made, cheers win, take the id-shnik. If not, add a new comparison to compares, but do not add the object.
The moderator looks through the comparisons that are not put down and compares the values from the corresponding “good” tables with the car brands, models, cities, etc.
Parents.
Everything works well while the tables are small. For example, car brands - there are only 100 of them. Compare times to spit. There are 7000 models in my database, and 20.000 modifications. Imagine, out of 20 thousand choose a comparison of the modification “1.6 Ti-VCT 5d”, which I have called “1.6 Ti-VCT”? The moderator is dying. Or you need a good search.
But you can make it easier. When loading ads, we will process comparisons in order, first the brand, then the model, after modification. We take a comparison for the brand,
find it or add it - not the point. We take the id-shnik of this comparison and write it in the additional field parent for model comparison: We
do the same in the modification, in the parent of which we write the model comparison id.
The moderator works in order. First he takes comparisons of brands and puts them all down. Then he takes a comparison of the model. At the same time, we see that the comparison has a parent-comparison of the brand, which is already affixed, therefore, as options for comparison, it is necessary to display not all possible models, but only those in which the brand corresponds to the value of this parent comparison. Well, that is, Ford was put down, and then Focus is chosen not out of 7000 models, but only out of hundreds of Ford models.
The point of this post is not that I came up with something completely new. I just never saw a description of these programs anywhere. And I like just the excessive practicality, because in principle it is clear that each object is a subset of the vertices of some trees, and the parser is a mapping of the html-code elements of the page to these vertices. One could bring a theory, something like a language for describing parsers, etc. ... On the other hand, the average php loader code for me takes 2 pages. And it is not clear whether it is worth taking a steam bath with the theory, because I can’t figure out how to further reduce and simplify this code, even using some abstract language.
The theoretical problem is that it is impossible to write a completely automatic loader. We can plug any info, but the base turns into a dump if the loader loses the classification of the source site. And when we start to save the classification, a problem arises.
Consider an example. Let there be a car site on which ads for selling cars from hundreds of other resources are loaded. I will declare the loader parsing, it will issue an array:
{марка:"ford", модель:"focus", модификация:"1.6 Ti-VCT 5d", описание: итд...}.
An automatic loader often works like this: it looks in the table of brands by name, if there is a ford, it takes the id of the brand, if not, it adds “ford” to the brand and takes its id. It does the same with the model and modification. Then it adds an ad with the received id-shniki. Such a system is bad in that there is sure to be an announcement in which “FORD” or not “VAZ” but “VAZ” or “AvtoVAZ” or not “St. Petersburg” but “St. Petersburg” will be in place of the brand, St. Petersburg, St. Petersburg. Smart Google will understand that these are synonyms, but our stupid loader, checking the names character by character, no. The result is a mess in the tables with classifications.
Trying to minimize the manual labor of the Mongol / moderator, I came up with such an algorithm.
First of all, the loader consists of two parts.
The first is loader_pages.
The script scans pages with lists of ads like these http://cars.auto.ru/cars/used/ford/focus/ and stupidly collects links to individual ads. + Finds links to page transitions and recurses through them. I found a link to declare it - added it to the database or, if it has already been added, updated the “last found date” to the current one. This is necessary so that (the loader works hourly) to delete objects whose link location date is quite old (this means that the link has not been found, which means the object has been deleted from the source).
The second is loader_offer.
Takes from the database not yet processed links, loads html, parsing. Gets an array of type
{марка:"ford", модель:"focus", модификация:"1.6 Ti-VCT 5d", описание: итд...}
Loads a compares plate. It contains comparisons that will be manually processed by the moderator. The plate consists of the fields:
{лоадер,тип,найденное значение,id в соответствующей таблице классификации}.
In our case,
{лоадер:"auto.ru",тип:"марка",значение:"ford",сопоставление:"..."}.
If the corresponding comparison has already been made, cheers win, take the id-shnik. If not, add a new comparison to compares, but do not add the object.
The moderator looks through the comparisons that are not put down and compares the values from the corresponding “good” tables with the car brands, models, cities, etc.
Parents.
Everything works well while the tables are small. For example, car brands - there are only 100 of them. Compare times to spit. There are 7000 models in my database, and 20.000 modifications. Imagine, out of 20 thousand choose a comparison of the modification “1.6 Ti-VCT 5d”, which I have called “1.6 Ti-VCT”? The moderator is dying. Or you need a good search.
But you can make it easier. When loading ads, we will process comparisons in order, first the brand, then the model, after modification. We take a comparison for the brand,
{лоадер:"auto.ru",тип:"марка",значение:"ford",сопоставление:"..."},
find it or add it - not the point. We take the id-shnik of this comparison and write it in the additional field parent for model comparison: We
{лоадер:"auto.ru",тип:"модель",значение:"focus",сопоставление:"...",parent:"id сравнения марки"}.
do the same in the modification, in the parent of which we write the model comparison id.
The moderator works in order. First he takes comparisons of brands and puts them all down. Then he takes a comparison of the model. At the same time, we see that the comparison has a parent-comparison of the brand, which is already affixed, therefore, as options for comparison, it is necessary to display not all possible models, but only those in which the brand corresponds to the value of this parent comparison. Well, that is, Ford was put down, and then Focus is chosen not out of 7000 models, but only out of hundreds of Ford models.
The point of this post is not that I came up with something completely new. I just never saw a description of these programs anywhere. And I like just the excessive practicality, because in principle it is clear that each object is a subset of the vertices of some trees, and the parser is a mapping of the html-code elements of the page to these vertices. One could bring a theory, something like a language for describing parsers, etc. ... On the other hand, the average php loader code for me takes 2 pages. And it is not clear whether it is worth taking a steam bath with the theory, because I can’t figure out how to further reduce and simplify this code, even using some abstract language.