Classification of ads from social. networks. In search of a better solution
I’ll tell you how the classification of the text helped me in finding an apartment, and also why I refused regular expressions and neural networks and began to use a lexical analyzer.
About a year ago I needed to find an apartment for rent. Most of the ads from individuals are published on social networks, where the ad is written in free form and there are no filters for searching. Manually viewing publications in different communities is long and inefficient.
At that time, there were already several services that collected ads from social networks and published them on the site. Thus, it was possible to see all the ads in one place. Unfortunately, there were also no filters by type of ads, price. Therefore, after some time I wanted to create my own service with the functionality I needed.
Text classification
First try (RegExp)
At first I thought to solve the problem head-on with regular expressions .
In addition to writing the regular expressions themselves, I also had to do further processing of the results. It was necessary to take into account the number of occurrences and their relative position relative to each other. The problem was with processing the text on sentences: it was impossible to separate one sentence from another and the text was processed all at once.
As the regular expressions became more complex and the result was processed, it became increasingly difficult to increase the percentage of correct answers in the test sample.
Regular expressions used in tests
- '/(комнат|\d.{0,10}комнат[^н])/u'
- '/(квартир\D{4})/u'
- '/(((^|\D)1\D{0,30}(к\.|кк|кв)|одноком|однуш)|(квартир\D{0,3}1(\D).{0,10}комнатн))/u'
- '/(((^|\D)2\D{0,30}(к\.|кк|кв)|двух.{0,5}к|двуш|двух.{5,10}(к\.|кк|кв))|(квартир\D{0,3}2(\D).{0,10}комнатн))/u'
- '/(((^|\D)3\D{0,30}(к\.|кк|кв)|тр(е|ё)х.{0,5}к|тр(е|ё)ш|тр(е|ё)х.{5,10}(к\.|кк|кв))|(квартир\D{0,3}3(\D).{0,10}комнатн))/u'
- '/(((^|\D)4\D{0,30}(к\.|кк|кв)|четыр\Sх)|(квартир\D{0,3}4(\D).{0,10}комнатн))/u'
- '/(студи)/u'
- '/(ищ.{1,5}сосед)/u'
- '/(сда|засел|подсел|свобо(ж|д))/u'
- '/(\?)$/u'
This method for the test suite gave 72.61% of the correct answers.
Second attempt (Neural Networks)
Recently, it has become very fashionable to use machine learning for anything. After training the network, it is difficult or even impossible to say why it decided that way, but this does not interfere with the successful use of neural networks in the classification of texts. For tests, a multilayer perceptron was used with a back propagation error training method .
We used the following libraries of neural networks:
FANN written in C
Brain written in JavsScript
It was necessary to convert text of different lengths so that it could be fed to the input of a neural network with a constant number of inputs.
For this, n-grams were revealed from all the texts of the test sampleof more than 2 characters and repeating in more than 15% of texts. There were a little more than 200 of them .
N-gram example
- /ные/u
- /уютн/u
- /доб/u
- /кон/u
- /пол/u
- /але/u
- /двух/u
- /так/u
- /даю/u
To classify one advertisement, n-grams were searched in the text, their location was found out, and then this data was fed to the input of the neural network so that the values were in the range from 0 to 1.
This method for the test set gave 77.13% of the correct answers (though that the tests were performed on the same sample on which the training was performed).
I am sure that with a test suite that is several orders of magnitude higher and using feedback networks, much better results could be achieved.
Third Attempt (Parser)
At the same time, I began to read more articles about natural language processing and came across a wonderful Tomita parser from Yandex. Its main advantage over other similar programs is that it works with Russian and has fairly clear documentation . You can use regular expressions in the configuration, which is very useful, since some of them have already been written.
Essentially, this is a much more advanced version of the regex option, but much more powerful and convenient. It also could not do without preliminary processing of the text. The text that users write on social networks often does not meet the grammar and syntactical norms of the language, so the parser has difficulty processing it: breaking text into sentences, breaking sentences into lexical units, and converting words to normal form.
Configuration example
#encoding "utf8"
#GRAMMAR_ROOT ROOT
Rent -> Word;
Flat -> Word interp (+FactRent.Type="квартира");
AnyWordFlat -> AnyWord;
ROOT -> Rent AnyWordFlat* Flat { weight=1 };
All configurations can be found here . This method for the test suite gave 93.40% of the correct answers. In addition to the classification of the text, facts are also highlighted from it, such as: rental price, apartment area, metro station, telephone.
Try the parser online
Request:
Answer:
Ad types:
0 - room
1 - 1 bedroom apartment
2 - 2 bedroom apartment
3 - 3 bedroom apartment
4 - 4+ bedroom apartment
curl -X POST -d 'сдаю двушку 50.4 кв.м за 30 тыс в месяц. телефон + 7 999 999 9999' 'http://api.socrent.ru/parse'
Answer:
{"type":2,"phone":["9999999999"],"area":50.4,"price":30000}
Ad types:
0 - room
1 - 1 bedroom apartment
2 - 2 bedroom apartment
3 - 3 bedroom apartment
4 - 4+ bedroom apartment
As a result, with a small test suite and the need for high accuracy, it turned out to be more profitable to write algorithms manually.
Service Development
In parallel with the solution of the text classification problem, several services were written to collect ads and present them in a user-friendly form.
github.com/mrsuh/rent-view The
service that is responsible for the display.
Written in NodeJS . The template engine doT.js and the Mongo database were used .
github.com/mrsuh/rent-collector
Service that is responsible for collecting ads. It is written in PHP . Uses the Symfony3 framework and the Mongo database .
I wrote with the expectation of collecting data from different sources, but, as it turned out, almost all the ads are posted on the social network Vkontakte. This social network has a great APITherefore, it was not difficult to collect ads from the walls and discussions in public groups.
github.com/mrsuh/rent-parser
Service that is responsible for the classification of ads. Written in Golang . Uses the Tomita parser . In essence, it is a wrapper over the parser, but also performs preliminary processing of the text and subsequent processing of the parsing results.
For all services, CI is configured using Travis-CI and Ansible (how automatic deployment is configured, wrote in this article ).
Statistics
The service has been working for about two months for the city of St. Petersburg and during this time managed to collect a little more than 8000 ads. Here are some interesting statistics on ads for the entire period.
On average, 131.2 ads are added per day (more precisely, texts that were classified as ads).
The most active hour is 12 days.
The most popular metro station is Devyatkino.
Conclusion : if you do not have a large test sample on which you could train the network, and at the same time you need high accuracy, it is best to use manually written algorithms.
If someone wants to solve a similar problem himself, then here lies the test suite from8000 texts and their types.