How users teach Yandex to warn about telephone spam

    Phone spam is familiar to everyone who has highlighted their number on the Internet, filled out a dubious offline profile, or who was simply unlucky to get to numerous databases. Today we will tell readers of Habrahabr how, through user feedback and machine learning, we taught the Yandex application to warn about unwanted calls.



    Calling from unfamiliar numbers is always a difficult choice. Does this call the long-awaited courier or another operator with a "unique" advertising offer? To solve this problem, there are mobile applications that operate on the basis of directories of well-known organizations. Partly they solve the problem. But the most aggressive spammers, dubious collectors and attackers do not fall into such databases. What to do?

    The idea of ​​creating our own caller ID came to us by chance. Attention was drawn to one of the company employees who carried two telephones with them. When the main phone was called from an unfamiliar number, he entered this number in a search engine on the second device and looked for reviews on the network. This method can hardly be called convenient, but we were inspired and decided to automate it a bit. We assembled the first prototype for Android, which did the following: during an incoming call, a window with a webview was opened, in which the search results were loaded by the number of the incoming call. Excellent! We managed to save on one phone. But seriously, despite the simplification of the routine, there was little benefit from this.

    Try to drive any phone number in the search engine. You are guaranteed to find sites that hint that they have reviews on the number. But if you click on the result, in most cases it turns out that the site simply generated pages for all possible numbers, but the reviews themselves are not there. To search for information on incoming calls in such conditions is too long and inefficient. The only way to do well is to find the answer right away. But this requires data.



    Yandex has a Directory. This is a knowledge base about organizations, which is updated by both companies and users. From there, information about organizations is taken when they are searched in the Search or Maps. When our internal prototype of the identifier of numbers for a mobile device for the first time switched from simple issuance to verdicts, the data was pulled up precisely from the Directory. But this was not enough: too often they called from numbers whose affiliation to certain companies is not advertised. To overcome this problem, you need to additionally collect feedback from users who have called from these numbers.

    We started with a simple one. Since last summer, Yandex search has been offering users to leave feedback on the phone number they are looking for in the search. Plain text box for free review. We did not limit the response to specific answer options, because we did not fully imagine the variety of sources of unwanted calls. The problem is that parsing reviews in free form is quite difficult to automate. But we bypassed this difficulty by using crowdsourcing platform Cleanup , which helped users to parse and classify the responses.



    So we started collecting data not only about well-known organizations with a relatively good reputation, but also about spammers, scammers, aggressive collectors, pranksters and even lovers of silence. Although not all categories could be safely recorded in unwanted calls. For example, calls from courier services are usually useful.

    The Directory data and the first user reviews formed the basis of the Yandex.Number of numbers, which launched last year in the web version of the Search. Yandex began to respond with verdicts to many queries containing phone numbers.



    Soon, an earlier version of the caller ID was built into the Yandex.Maps application. She worked only on the basis of the Directory, since reviews on other categories were still not enough for quality work. This led us to the next stage in the development of the determinant. It is necessary to collect reviews on a mobile device and immediately after calls from unknown numbers, and not wait for them on the web. But how to do that? The first internal attempts to collect feedback after any call led to problems. Too frequent requests annoy users. Moreover, if any user can leave a response to any incoming call, then this provokes and simplifies the wrapping. It was necessary to act smarter.

    Yandex specializes in machine learning. With it, Search builds SERPs, the Browser detects malicious sites, and Music recommends tracks. Machine learning allows us to identify non-obvious patterns in the analysis of a large number of heterogeneous factors. Therefore, we applied it in the new version of the caller ID, which now works in the Yandex application for Android. Our technology, based on the CatBoost library , analyzes more than two hundred factors when deciding whether to request a review. For example, the frequency and duration of a call. For obvious reasons, we will keep silent about the other factors, but this solution allowed us to reduce obsession and to complicate the wrapping up of reviews as much as possible.

    A few words about how it works now. If the application userYandex included a qualifier in the settings, then when calling from unknown numbers, a request is sent to our cloud, from where the verdict is returned.



    By the way, the verdict can be viewed for missed calls. This is convenient when you do not know whether to call back.

    If Yandex does not know exactly where the call is coming from, then upon completion, the user can see a request to leave a review. The likelihood of this request appearing depends on the analysis of all the factors in the cloud.



    Now we are collecting new reviews that will inevitably affect the development of caller ID technology in the future. If you have experience creating such systems, or you see an alternative solution to the problem of telephone spam and other unwanted calls, then we would be interested to discuss this. Thanks.

    Also popular now: