21st Century Housing Search Without Intermediaries

    I suppose we all once searched for a place to live. Someone is in the property, most are probably for rent. Everyone who at least once tried to find real offers on bulletin boards knows - this is unrealistic. Perhaps there is no such amount of spam in any other area. After you plunge into this hell, usually your hands begin to itch to apply their IT-shnost for the benefit of others. The result for me was the Sobnik project, which I want to talk about.

    Sobnik is a Chrome plugin that marks intermediaries on message boards. While it works only with Avito.ru , in the near future I will add Irr.ru and other large boards. All who are sitting on their suitcases and who are eager to try, please go to the Google Web Store. Under the cut, I will talk about the technical side of the project, about its prospects and about my observations of the enemy by intermediaries. Fans of criticizing someone else’s JS code are also welcome, the source of the client part of the plugin is available on github .


    For fans of accuracy I’ll clarify: formally, Sobnik is an “extension”, not a plug-in, but it’s too painful for me to get used to the last term.

    Why all this?


    “The benefit to society,” I hope, is obvious, so I’ll immediately turn to the question “Why do I personally need it.” Faced for the last time with a search for housing, spitting on the spam that filled the Internet, having seen enough of inventive realtors, I felt just the same prick of conscience. After all, how can ships plow the expanses of the 21st century, are we programmers really unable to deal with miserable spammers?

    Upon reflection, I ventured to suggest that I am capable. Viewing a few hundred ads was enough to understand that it’s easy to identify intermediaries. Either by the content of the announcement, which is too suspicious or obviously agent-based, or by the presence of many offers with the same phone number. It remained to choose the technologies on the basis of which this idea could be verified - the ads had to be parsed, saved somewhere, and analyzed. I chose Google Chrome as a parser - to access all the necessary information on message boards, a full-fledged browser engine with working JavaScript is required. For server cases, I decided to try Go and MongoDB. All three things were new to me, so it was a great opportunity to expand my horizons and learn something new. The result is Sobnik.

    How to identify agents?


    At first glance - quite simple. An accessible and reliable indicator is the phone number to which many ads are given. After all, the agent will not buy a new SIM card for each ad! In addition, some ads contain direct references to what the author is a realtor and wants a commission. In theory, it is of course simple; in practice, many small issues had to be addressed:
    1. Avito and many other boards publish the phone number in the form of an image, respectively - the number has to be recognized.
    2. Agents actively hide their real phones. The telephone number is indicated in the text of the announcement, in words, letters, special characters. All this camouflage has to be identified and opened.
    3. Some owners give a lot of ads for the same apartment. In order not to enroll them in realtors, you have to find out about different objects in different announcements, or about the same thing. I didn’t get involved with address recognition, I use ready-made geographic coordinates available on many boards.
    4. The most advanced intermediaries draw their real phone numbers on photos of apartments. Such comrades are most difficult to identify. I did not find a reliable and easy-to-use OCR solution capable of recognizing numbers on photos. I had to wiggle and give birth to a simple algorithm that determines whether there is any text on the photo, and consider such ads as agency ones.
    5. There is often a direct mention in the ad text that the author is an agent. However, since computers have not yet learned to understand speech, I have not come up with a reliable method for making full use of this information. While it has cost to detect some of the most common and unambiguous phrases, the benefit of this criterion only complements the main detector by phone numbers.

    Using these techniques allows you to automatically identify the majority of intermediaries. This is what Avito looks like during spammers' activity (red and green circles are the result of Sobnik’s work):
    image

    The technical side of the project


    The plugin is written in JavaScript, since the Chrome API functionality is quite enough for the intended purposes. The only difficulty was getting the phone number image. The fact is that Avito gives it only for requests with the correct Referer. There is no way to fake this header in the browser, and Cross-Origin Policy does not allow you to get image data uploaded by the Avito page . It turned out that this protection was easily circumvented - I save the page in MHTML format through the corresponding API , and then from the resulting string I cut out the piece I need containing the base64-encoded image. In the same way I get access to photos of apartments.

    Further, the information is sent to the server where the program runs on Go. In fact, there are two programs - all requests are executed asynchronously, one program writes all requests to the queue, the second program processes these requests. The client part has built-in logic to slow down the flow of calls to the server if it does not have time to complete requests on time. This approach will smooth out the jumps in the load (I really hope that they will arise today). Data is stored in MongoDB.

    I placed all this economy on Amazon AWS (one more thing which I wanted to try). While “Free Tier” is enough, so I don’t pay for hosting.

    The server API is publicly available, no authorization. I suspect that there are those who want to indulge and play a dirty game, so in the near future - to introduce some protection. Ultimately, I’ll almost certainly come to the registration of plug-in users, but I don’t want to add extra barriers for those who want to try.

    The source code of the plugin is open . Firstly, you still can’t hide it. Secondly, you can immediately see what kind of information the plugin collects, so that understanding people will not have questions regarding privacy. Well, finally, suddenly one day there will be enthusiasts who want to participate in the development.

    There is no centralized crawler for data collection. Firstly, Avito cuts off IP-schnicks, which open about a couple of hundred pages per hour. Secondly, I hope that when there are a lot of users, we get a distributed crawler - everyone will open a couple of ads, and the database is full. However, while there are no active users, the database is empty. The main benefit of the plugin is that you do not need to open agent ads, and if the database is empty, you will have to open everything in a row. In general, in order to give the system at least some acceleration, I made another plug-in for internal use, which quietly, about a page per minute, scans offers to rent apartments in Moscow on Avito. You can’t keep up with the spammers during peak hours, but nevertheless, dear reader, you will have the opportunity to evaluate how Sobnik works: we’ve installed,section and enjoy. I will be glad to suggestions on how to arrange Avito scanning on a more serious scale. I can give out a plugin for crawling, if you suddenly want to help the project or scan another city or section.

    Realtor Observations


    By running a rental scan in Moscow, I made some useful observations. All of them are quite logical and seem obvious, however, Sobnik allowed them to visually check and confirm:
    1. On business days, about 80% of the ads are owned by agents. Avito, by the way, is actively banning a lot of ads, so out of 30 ads per minute in an hour there is only 10 left. However, of these ten, the vast majority are still intermediaries.
    2. Late in the evening (after 10-11 hours), and on weekends - there are almost no agents. Resting to see from the heavy spamming everyday life.
    3. Paid ads (they are highlighted in yellow on Avito) are almost always owners. So far I have seen only one agent who did not regret a hundred rubles for advertising an elite apartment. It is likely that this was the owner, who decided to pretend that he was an agent with an exclusive and cut down the extra money (there are such, judging by the rumors).
    4. If the ad contains only one or two photos, it’s almost certainly an agent. Three pictures - 50 to 50. The owners either write without a photo at all, or if they are strained, they make at least five.
    5. If the phone is indicated on the photo or is “encrypted” in the text of the advertisement, this is almost certainly an agent. They are encrypted in this way by Avito, which requires money for placing a large number of ads on the same phone number.

    This list, in general, allows you to filter out almost all the garbage with your eyes, so if you are too lazy to put Sobnik - use it.

    Disclaimer: I'm not against realtors. For them, on Avito, if you have a special checkmark, put it on - and everyone immediately understands that you are an agent. And of course I am aware that in many cases an agent is simply necessary. Sobnik fights only with those who spam and tries to deceive you.

    Prospects


    I plan to develop the project in two directions:
    1. Add new boards (the next one will probably be From Hand to Hand).
    2. Improve the accuracy and reliability of the detector.

    Theoretically, when many boards will be actively scanned, Sobnik will be able to find the original owner’s ad from copies of it published by agents on other boards. Will time be able to reach these heights, and of course your valuable comments.

    I do not plan to publish the collected database of ads, it would be too brazen to steal and distribute this information. However, since Avito’s financial plan does not allow them to filter spammers themselves, Sobnik will do this.

    Your wishes and suggestions will be very happy.

    UPDATE:


    Since October 10, the problem of filling the database has been solved - the installed plug-in in a separate tab automatically scans the ads that are currently required by users. In fact, now Sobnik is a large computer network, where each node works for a common cause. Thus, any clean list of ads for any region is processed in a couple of minutes. Thanks to everyone who offered their help, free servers, IPs and Internet channels, your desire to help makes me incredibly happy. However, now Sobnik copes with this himself.

    Also popular now: