The architecture of the service for collecting and classifying housing ads from Vkontakte



    In this article I will talk about how the service for searching for housing ads from Vkontakte is organized and developed, why a service-oriented architecture was chosen, and also what technologies and solutions were used in its development.

    The service has been running for more than nine months.

    During this time:

    • It managed to cover 21 of the largest cities in Russia. Among them are such as Moscow, St. Petersburg, Yekaterinburg and Kazan.
    • It turned out to increase the total number of metro stations from 65 to 346 .
    • The average number of ads increased from 131.2 to 519.41 per day.
    • A settings control panel has been added.
    • Bots have been added for Telegram and Vkontakte . They automatically notify subscribers of new announcements.

    Further in the text I will use the word service - as an SOA module , and not the entire web service.

    I chose the SOA architecture because it made it possible:

    • Use different technologies to solve different problems.
    • Develop a service independently of others.
    • Deploy services independently of others.
    • Scale services horizontally.

    You could call it microservice architecture, but there were slight differences. Between services, data exchange based on the “Common Database” using the MDBWP protocol is used instead of the usual HTTP API for microservices and storing data for each service in its own database. This approach was due to the speed of development with the ability to preserve all the advantages of the described SOA approach.

    To automate the deployment, Ansible was selected.
    This is one of the configuration management systems that has a low entry threshold. MongoDB

    was selected as the database. This document-oriented database was perfect for storing ads with a list of metro stations, contact details of landlords, as well as a description of the advertisement.

    At the moment, the general scheme of interaction of services is as follows:



    Services:




    rent-view - a service for displaying ads and searching for them


    github.com/mrsuh/rent-view The



    service is written in NodeJS , as the most important criterion for its quality was the server’s response speed to the user.

    The service calls for ads in MongoDB , renders HTML pages using the doT.js template engine and gives them to the browser.

    The service is built using Grunt .

    To work in the browser scripts are written in pure JS , and styles in LESS . As a proxy server, Nginx is used , which caches part of the responses and provides an HTTPS connection.

    rent-collector - ad collection service


    github.com/mrsuh/rent-collector The



    service collects ads, classifies them and writes them to the database.

    It is written in PHP for several reasons: knowledge of the necessary libraries for writing a service, as well as high development speed.

    Uses the symfony 3 framework . Beanstalk

    was chosen as the queue service . He is lightweight, but does not have his own message broker. This is exactly what you need for a small virtual server and for data that is not critical to loss. Using beanstalk , 4 messaging channels were made :



    • parser - extracts facts such as ad type, price, description and links from the text. To speed up data processing, I launched several consumers for this channel.
      Note: the consumer communicates with the rent-parser service.
    • collector - writes processed data about ads to the database.
    • notifier - notifies users of new announcements. Note: the consumer communicates with the rent-notifier service.
    • publisher - publishes ads in several Vkontakte groups.

    rent-parser - classified ads service


    github.com/mrsuh/rent-parser
    Service is written in Golang .

    To extract structured data from text, the service uses the Tomita parser from Yandex . Carries out preliminary processing of the text and the subsequent processing of parsing results.

    So that you can test the service, I made an open API .

    Try the parser online
    Inquiry:
    curl -X POST -d 'сдаю двушку за 30 тыс в месяц. телефон + 7 999 999 9999' 'http://api.socrent.ru/parse'

    Answer:

    {"type":2,"phone":["9999999999"],"price":30000}

    Ad types:
    + 0 - room
    + 1 - 1 room apartment
    + 2 - 2 room apartment
    + 3 - 3 room apartment
    + 4 - 4+ room apartment
    + 5 - studio
    + 6 - not an ad

    In more detail about classification of announcements I wrote here habrahabr.ru/post/328282

    rent-control - settings management service


    github.com/mrsuh/rent-control



    It is written in PHP for several reasons: knowledge of the necessary libraries for writing the service, as well as high development speed.
    Uses the symfony 3 framework .
    Style Library Bootstrap 3 .

    The settings that the service controls include:

    • ads;
    • black list;
    • publication configurations;
    • parsing configurations.

    Initially, all the data for parsing control lay in the configuration files. With the increase in the number of cities, it was necessary to visualize them and simplify editing records. In addition, it was necessary to simplify the addition of new parameters.

    rent-notifier - a bot service for sending new announcements to Telegram and Vkontakte.


    github.com/mrsuh/rent-notifier

    Example of subscribing to ads: The



    service is written in Golang because of the criticality to the speed of response to the user.
    The essence of the service is as follows: you subscribe to receive new announcements, and as you add the bot sends you messages about them. The service inserts a link to the original ad in the message text.

    Helper Repo




    Code for a shared database in PHP


    github.com/mrsuh/rent-schema

    General database schema:



    With the addition of rent-control service, a duplication of the database schema code appeared. Therefore, it was decided to put the code in a separate package. Now for any service in PHP it is enough to add this package through composer depending on .

    composer require mrsuh/rent-schema


    ODM for mongoDB


    github.com/mrsuh/mongo-odm The

    first ODM for PHP MongoDB that I thought of was Doctrine 2 . It comes with Symfony 3 and has good documentation.

    But at the time of writing the service, in order for this ODM to work with the latest drivers for Mongo PHP , it was necessary to put another package as a layer between the new and old APIs . Doctrine 2 itself is a rather large project, and with an additional package it became even larger. Instead, I wanted something lightweight. Therefore, I decided to write ODM myself with a minimal functional set. And I did it - ODM completely copes with its duties.

    Some statistics




    The service adds an average of 519.41 ads per day to the site .

    The most popular metro stations, among the largest cities in Russia, were the following:

    • Saint Petersburg - Devyatkino
    • Moscow - Komsomolskaya
    • Kazan - Victory Avenue
    • Ekaterinburg - Uralmash
    • Nizhny Novgorod - Avtozavodskaya
    • Novosibirsk - Marx Square
    • Samara - Moscow

    More statistics can be found on the site itself.

    Conclusion




    If you have not yet decided whether you need an SOA architecture, then make a monolithic application with a breakdown into modules. So it will be easier to transfer your application to services if necessary. But if you still decide to use the SOA architecture, you should understand that this can increase the complexity of the development, the complexity of the deployment, the amount of code, as well as the amount of messages between services.

    PS I found the last two apartments using my service. I hope he helps you too.

    Also popular now: