Distributed Housing Computing

    Everyone has heard about distributed computing projects that are trying to solve large-scale tasks, such as searching for extraterrestrial life, drugs for AIDS and cancer, finding prime numbers and unique solutions for Sudoku. All this is very entertaining, but nothing more, because there is no practical benefit for a person who has shared the resources of his computer.

    Today I will talk about distributed computing that solves your problems. Well, not all of course, but only a few related to the search for housing. I recently wroteabout the Sobnik project, an extension for Chrome that detects middlemen on message boards. Two weeks ago, a new version of the program was launched, in which the work of scanning and analyzing ads is distributed on users' computers. Over the past time, about a million ads from more than a thousand cities in Russia have been processed, and this is only the beginning. Details, technical details and a few more numbers are waiting for you under the cut.



    Why is distributed computing here?


    In order to identify an intermediary, the program needs to download an ad, parse it, analyze the photos, put the results in a database and perform several types of searches, in particular by phone numbers. Initially, the user was required to open a specific ad for the plugin to perform this analysis. This is inconvenient, because there are hundreds of ads, and almost all of them are intermediaries. It was necessary to get rid of manual operations, because computers exist in order to act automatically!

    The first plan was to create a centralized ad scanner (hereinafter referred to as the crawler ). We take PhantomJS , finish the plug-in JS code for it, and run a scan on the server. AvitoOf course, he will be angry at a large-scale scan and will block the IP server, but in that case it was decided to buy IP addresses or use a bunch of proxies (Avito is the only bulletin board with which the plugin works). But the longer I worked and thought in this direction, the more I understood that the solution was bad:
    1. Servers cost money that I didn’t want to spend - the project is free, and I dream of leaving it as such.
    2. There is no reliability - there was no desire to compete with Avito in search of unbanned proxies.
    3. The scalability of such a solution rested on even more money, see paragraph 1.

    Soon I came to the conclusion that the only solution devoid of the aforementioned drawbacks would be an automatic crawler built into the plugin and working in browsers for users even when they themselves are not busy finding housing.

    Distributed crawler


    Only a small percentage of users are actively working with Avito at any given time, however, a browser with a working plug-in is launched almost all day. What if each user’s browser would scan, say, one ad per minute? These total capacities would be enough to quickly scan the ads that active users are currently working with. And Avito would not be upset, because overloads from certain IPs would never have arisen. Yes, probably not everyone will want to share the power of their computers, but I believed that “conscious citizens” would still be enough. The idea looked very tempting.

    Having delved into the technical part of the new approach, I realized that the current server simply could not cope with the increased loads. Processing photos to detect phone numbers already consumed almost the entire CPU, but if the ad flow increases by an order of magnitude? A free virtual server from AWS Free Tier had to cope, otherwise the life of the project was in jeopardy. So the first step was to transfer the work with images to the scalable part of the system - to the client. The plugin code is open , and of course there were fears that the publication of the algorithm would make it easier for evil spammers to find weaknesses and ways to cheat. However, I did not see other options for the development of the project, and decided - whatever happens.

    So, now Sobnik opens a separate tab in your browser. Periodically, about once a minute, he requests a task from the server - the address of the advertisement for scanning, downloads this advertisement in his tab, parses, processes the images and sends the results to the server. Scanning tasks are created when one of the users opens the list of ads - in advance and nothing is scanned for the future, ads are a perishable product.

    Conscious citizens who want to share the power of their PC, so far not so much - about a couple of hundred. However, this is enough to process 60 thousand ads per day, providing the needs of thousands of plug-in users. A list of hundreds of ads opened on Avito is usually processed within one minute.

    It all sounds pretty simple, but I'm still fascinated by the ability to use hundreds of other people's computers to solve a common problem. The browser is the ideal platform for this kind of application:
    1. JavaScript now knows almost everything.
    2. JS code is cross-platform (with the exception of calls to the browser API).
    3. The extension is installed in one click.
    4. Software updates are distributed by Webstore.
    5. The browser is launched most of the day.

    If you are planning your own SETI @ HOME - take a closer look, maybe you should learn the API of your favorite browser. Many have already written about this , but the heyday of distributed computing that solves the practical problems of volunteers, rather than abstract human civilization, has not yet been seen.

    It is worth paying attention to one nuance - people are very annoyed when tabs open in the browser on their own. And if you need to open them, it is worth spending enough effort to ensure that such behavior is expected for users (instructions, faq, understandable informational messages). I take this opportunity to apologize to everyone who was surprised by the behavior of the plug-in after the release of the new version of the program - I should have spent more time informing you about such significant innovations.

    Efficient server


    A server coordinating work in a distributed computing system must be very efficient, because it is not so easy to scale. The server of the Sobnik project is engaged not only in coordination, but also in solving the applied problem - searching the database, so the requirements for it are even higher. And if the functionality of the client part was just slightly expanded, the server was almost completely redone (recall, it is written in Go).

    Initially, MongoDB was responsible for storing data and performing searches, now all operational data is stored in RAM. Two processes are running on the server - frontend and backend. The frontend interacts with clients, maintains a job queue for a distributed crawler, a queue of ads for transferring to the backend, and a small cache with information about the status of the ads (intermediary / owner). The backend stores data in RAM (squeezing out incoming ads), searches for them, writes incoming ads asynchronously to MongoDB, and asynchronously executes front-end requests. Processes interact with ZeroMQ .

    MongoDB is used in the most effective modes - for sequentially writing incoming announcements, and for sequential reading when restarting the backend, in one stream. At the start, the backend reads the current announcements from the DBMS, performs the same searches and analysis, as with the regular receipt of data from clients, and fills its database in RAM. This approach allows you to easily modify the algorithm for identifying intermediaries, and apply it to the entire database with a simple restart of the backend.

    While the backend is running, filling its base, the frontend continues to serve customers - it puts the incoming ads in a queue, serves status requests from the cache, distributes tasks from its queue.

    In a system with almost a dozen parallel threads ( goroutine) there is not a single mutex. Each stream has exclusive access to its data, and serves other streams through channels (their presence in Go is simply incredibly pleasing).

    All interprocess and inter-thread interactions are performed in non-blocking mode. If the receiving party does not have time to read, the data is either added to the queue (incoming ads on the front-end) or discarded (other requests). All queues and cache have a size limit, and when filling in, the oldest entries are discarded - the latest information is more valuable to the clients of the Collection. If it is impossible to fulfill the request (the data channel is busy), the corresponding response is sent to the client, while the client tries again, with an increasing interval.

    Such an architecture was created in order to ensure maximum reliability and service availability. And so far, it is paying off - having almost a million ads in the database and processing up to a million http-requests per day, load_average on a virtual server stably stays below 10%.

    We are looking for housing in a new way!


    Sobnik successfully survived the rebirth - "distributed" and strengthened. A month ago, it was just an interesting idea, a prototype that many installed to indulge. Now - this is a working tool that is convenient to use and which really helps to filter spam. This was made possible thanks to you, Dear Users, because it is your computers that solve the general problem.

    Of course, the system does not work perfectly, there are many questions for the detector for photos, and there are problems with the analysis of the text of the ads. However, it is now clear that the Collector has enough resources to resolve these issues. It remains only to fuel the enthusiasm of the author, which you can achieve with criticism, suggestions and valuable comments.

    Also popular now: