We calculate by IP: how to deal with spam in a social network

    Spam on social networks and instant messengers is a pain. Pain for honest users and developers. How they are fighting in Badoo, Mikhail Ovchinnikov told in Highload ++, then the text version of this report.


    About the speaker: Mikhail Ovchinnikov works in Badoo and has been engaged in antispam for the last five years.

    Badoo has 390 million registered users (data as of October 2017). If we compare the size of the audience of the service with the population of Russia, then we can say that in our country, according to statistics, every 100 million people are protected by 500 thousand policemen, and in Badoo every 100 million users are protected from spam by just one Antispam employee. But even such a small number of programmers can protect users from various troubles on the Internet.

    We have a large audience, and there may be different users:

    • Good and very good, our favorite paying customers;
    • The bad ones are those who, on the contrary, are trying to make money on us: they send spam, lure money, engage in fraud.

    Who have to fight


    Spam can be different, often it can not be distinguished at all from the behavior of an ordinary user. It can be manual or automatic - we also want to get bots that are engaged in automatic mailing.

    Perhaps you, too, once wrote bots - were engaged in the creation of scripts for automatic posting. If you are doing this now, it is better not to read further - you cannot in any way find out what I am about to tell.

    This is, of course, a joke. The article will not have information that will simplify the life of spammers.



    So who do we have to fight? These are spammers and scammers.

    Spam appeared a long time ago, from the very beginning of the development of the Internet. In our service, spammers, as a rule, try to register an account by uploading a photo of an attractive girl . In the simplest form, they begin to send the most obvious types of spam links.

    A more complicated option is when people do not send anything frank, do not send any links, do not advertise anything, but lure the user to a more convenient place for them, for example, in instant messengers : Skype, Viber, WhatsApp. There, they can, without our control, sell anything to the user, promote, etc.

    But spammers are not the biggest problem . They act obviously and are easy to fight. Much more complex and interesting characters are scammers who pretend to be another person and try to deceive users by all means available on the Internet.

    Of course, the actions of both spammers and scammers are not always very different from the behavior of ordinary users, who also sometimes do this. There are many formal signs for those and others that do not allow a clear distinction between them. It is almost never impossible to do.

    How to fight spam in the Mesozoic era


    1. The simplest thing you could do was write separate regular expressions for each type of spam and enter every bad word and every single domain in this regular program. All this was done manually, and, of course, it was the most inconvenient and inefficient.
    2. You can manually find dubious IP addresses and enter them into the server config so that suspicious users will never visit your resource again. This is inefficient because IP addresses are constantly reassigned, redistributed.
    3. Write one-time scripts for each type of spammer or bot, run their logs, manually find patterns. If a little something changes in the behavior of the spammer, everything stops working - also completely ineffective.


    First, I will show the simplest methods of dealing with spam, which everyone can implement. Then I will tell you in detail about the more complex systems that we developed using machine learning and other heavy artillery.

    The simplest ways to deal with spam


    Manual Moderation


    In any service, you can hire moderators who will manually view the user's content and profile, and decide what to do with this user. Usually such a process looks like finding a needle in a haystack. We have a huge number of users, moderators less.



    In addition to the fact that moderators obviously need a lot, they need a large infrastructure. But, in fact, the most difficult thing is that a problem arises: how, on the contrary, to protect users from moderators.

    It is necessary to make so that moderators did not get access to personal data. This is important because, in theory, moderators can also try to harm. That is, we need antispam for antispam so that the moderators are under tight control.

    It is obvious that you will not check all users in this way. HoweverIn any case , moderation is necessary , because any systems need further training and a human hand that will determine what to do with the user.

    Statistics collection


    You can try to use statistics - for each user to collect various parameters.



    User Innocent logs in from his IP address. The first thing we do is log what IP address it came from. Then we build a direct and reverse index between all IP addresses and all users so that you can get all the IP addresses from which a particular user logs in, as well as all users who logged in from a specific IP address.

    In this way, we get the relationship between the attribute and the user. Such attributes can be quite a lot. We can begin to collect information not only about IP-addresses, but also photos, devices from which the user came - about everything that we can determine.



    We collect such statistics and associate them with the user. For each of the attributes, we can collect detailed counters.

    We have a manual moderation that decides which user is good, which is bad, and at some point the user is blocked or recognized as normal. For each attribute, we can separately obtain data, how many users, how many of them are blocked, and how many are recognized as normal.

    Possessing such statistics on each of the attributes, we can roughly determine who the spammer is, who does not.



    Let's say we have two IP addresses - on one 80% of spammers, on the second 1%. Obviously, the first one is much more spamming, something needs to be done and some sanctions applied.

    The simplest is to write heuristic rules.. For example, if blocked users are more than 80%, and those who are considered normal - less than 5%, then this IP address is considered bad. Then we do something else with all the users with the same IP address.

    Collecting statistics from texts


    In addition to the obvious attributes that users have, you can also do text analysis. You can automatically parse user messages, isolate from them everything that is relevant to spam: mentions of instant messengers, phones, email, links, domains, etc., and collect exactly the same statistics on them.



    For example, if a domain name was sent in messages by 100 users, 50 of them were blocked, then this domain name is bad. It can be blacklisted.

    We will receive a large number of additional statistics for each of the users based on the text messages. No machine learning is needed for this.

    Stop words


    In addition to the obvious things - phones and links - you can isolate phrases or words from the text, which are especially characteristic of spammers. You can keep such a list of stop words manually.

    For example, on the accounts of spammers and scammers, the phrase is often found: "There are a lot of fakes here." They write that they are generally the only ones here who are set up for something serious, all the other fakes who cannot be trusted in any way.

    On dating sites on statistics, spammers more often than ordinary people use the phrase: “I'm looking for a serious relationship.” It is unlikely that an ordinary person will write like that on a dating site - with a probability of 70%, this is a spammer who is trying to lure someone.

    Search for similar accounts


    With statistics on the attributes and stop words found in the texts, you can build a system to search for similar accounts. It is necessary to find and ban all accounts created by the same person. A spammer that has been blocked can immediately register a new account.

    For example, the user Harold logs in, registers on the site and provides his own rather unique attributes: IP address, photo, stop word that he used. Maybe he even signed up with a fake Facebook account.



    We can find all users who are similar to it, in which one or several of these attributes match. When we know for sure that these users are connected, we use the very direct and reverse index to find the attributes, and by them all users, and rank them. If, say, the first Harold is blocked, then the rest are also easy to “kill” with the help of this system.

    All the methods that I have described are very simple: it is easy to collect statistics, then it is easy to search for these attributes by these attributes. But, despite the ease, with the help of such simple things — simple moderation, simple statistics, simple stop words — 50% of spam can be defeated .

    In our company, in the first six months of work, the Anti-spam department won 50% of spam. The remaining 50%, as you know, are much more complicated.

    How to make life difficult for spammers


    Spammers invent something, trying to make life difficult for us, and we try to deal with them. This is an endless war. There are much more of them than us, and for each of our steps they come up with their own multi-path.

    I'm sure there are conferences of spammers somewhere where the speakers tell how they defeated Antoo Spam Badoo, about their KPIs, or how to build scalable, fault-tolerant spam using the latest technologies.

    Unfortunately, we are not invited to such conferences.

    But we can make life difficult for spammers. For example, instead of directly showing the “You are blocked” window to the user, you can use the so-called Stealth banning  - this is when we do not tell the user that he is banned. He should not even suspect it.



    The user gets into the sandbox (Silent Hill), where everything seems to be real: you can send messages, vote, but in fact it all goes into emptiness, into the fog. No one will ever see or hear, no one will receive his messages and votes.

    We had a case when one spammer spammed for a long time, promoted his bad goods and services, and six months later decided to use the service as intended. He registered his real account: real photos, name, etc. Naturally, our search system for similar accounts quickly calculated it and placed it on Stealth ban. After that, for six months he wrote in the void that he was very lonely, no one answered. In general, he poured his whole soul into the mist of Silent Hill, but received no response.

    Spammers, of course, are not fools. They are trying to somehow determine that they hit the sandbox and that they were blocked, to quit the old account and find a new one. Sometimes we even have the idea that it would be nice if several such spammers were sent to the sandbox together so that they could sell each other whatever they wanted and have fun whatever they wanted. But so far we have not reached this point, but we are inventing other methods, for example, photo and telephone verification.



    As you know, a spammer who is a bot and not a person is difficult to verify by phone or photo.

    In our case, the verification by the photo looks like this: the user is asked to take a picture with a certain gesture, the resulting photo is compared with the photos that are already uploaded to the profile. If the faces are the same, then most likely the person is a real person who has uploaded his real photos and can be left behind for a while.

    Spammers pass this test is not easy. We even had a small game inside the company called Guess Who the Spammer was. Given four photos, you need to understand which of them is a spammer.



    At first glance, these girls look completely harmless, but as soon as they begin to pass the photo-verification, then from a certain moment it becomes clear that one of them is absolutely not what she claims to be.

    In any case, spammers struggle with photo-verification. They are really suffering, trying to somehow get around it, to deceive, and demonstrate all their photoshop skills.



    Spammers do everything they can, and sometimes they think, probably, that this is all completely handled by some incredible modern technologies that are so poorly built that they are so easily deceived.

    They do not know that each photo is then manually checked by moderators.

    No time!


    In fact, despite the fact that we come up with various ways to make life difficult for spammers, there is usually not enough time, because anti-spam should work instantly. He must find and neutralize the user before he began his negative activity.

    The best thing to do is to determine at the registration stage that the user is not very good. This can be done, for example, using clustering.

    User clustering


    We can collect all possible information right after registration. We still have no devices from which the user comes in, no photos, there is no statistics. We have nothing to send it for verification, he has not done anything suspicious yet. But we already have primary information:

    • floor;
    • age;
    • country of registration;
    • country and IP provider;
    • email domain;
    • telephone operator (if any);
    • data from fb (if any) - how many friends does he have, how many photos has he uploaded, how long has he been registered there, etc.

    All this information can be used to find clusters of users. We use the simple and popular K-means clustering algorithm . Everywhere it is perfectly implemented, supported in any MachineLearning libraries, it is perfectly parallel, it works quickly. There are streaming variants of this algorithm, which allow to distribute users to clusters on the fly. Even in our volumes, it all works quickly enough.

    Having received such user groups (clusters), we can do any actions. If users are very similar (the cluster is strongly connected), then, most likely, this is a mass registration, it should be immediately stopped. The user has not yet had time to do anything, just clicked the “Register” button - and that’s all, he already got into the sandbox.

    You can collect statistics on clusters - if 50% of the cluster is blocked, then the remaining 50% can be sent for verification, or you can manually moderate all the clusters, view the attributes for which they match, and decide. Based on such data, analysts can highlight patterns.

    Patterns


    Patterns are sets of the simplest attributes of users that are immediately known to us. Some of the patterns actually work very effectively against certain types of spammers.

    For example, consider the combination of three completely independent, fairly general attributes:

    1. User is registered in the USA;
    2. Its provider Privax LTD (VPN operator);
    3. Email-Domain: [mail.ru, list.ru, bk.ru, inbox.ru].

    These three attributes, it would seem, separately from themselves not representing anything, together give the likelihood that this is a spammer, almost 90%.

    Such patterns can be extracted any number for each type of spammer. It is much more efficient and easier than manually viewing all accounts or even clusters.

    Text clustering


    In addition to clustering users by attributes, you can find users who write the same texts. Of course, this is not so easy. The fact is that our service works in very many languages. Moreover, users often write with abbreviations in slang, sometimes with errors. But the messages themselves are usually very short, literally 3-4 words (approximately 25 characters).

    Accordingly, if we want to find similar texts among the billions of messages that users write, we need to come up with something unusual. If you try to use classical methods based on the analysis of morphology and true honest processing of the language, then with all these restrictions, slangs, abbreviations and a bunch of languages, it is very difficult to do.

    You can do a little more simply - apply the algorithm n-gram. Each message that appears is broken down into n-grams. If n = 2, then these are bigrams (pairs of letters). Gradually, the entire message is divided into pairs of letters and statistics are collected, how many times each digram occurs in the text.



    On the bigrams you can not stop, but add trigrams, skipgrams (statistics by letters after 1, 2, etc. letters). The more information we get, the better. But even bigrams already work quite well.

    Next, from the digrams of each message, we obtain a vector whose length is equal to the square of the length of the alphabet.

    It is very convenient to work with this vector and cluster it, because:

    • consists of numbers;
    • compressed, there are no voids;
    • always fixed size.
    • The k-means algorithm with such compressed fixed-size vectors runs very fast. Our billions of messages are clustered in just a few minutes.

    But that is not all. Unfortunately, if we just collect all the messages that are similar in frequency to bigrams, we will get messages that are similar in frequency to bigrams. At the same time, they do not have to be in fact at least somehow similar in meaning. Often there are long texts in which the vectors are very close, almost the same, but the texts themselves are completely different. Moreover, starting from a certain length of text, this clustering method will generally cease to work, since frequencies of digrams will be equal.


    Therefore, you need to add filtering. Since the clusters are already there, they are rather small, we can easily do filtering within the cluster using Stemming or Bag of Words. Inside a small cluster, you can literally compare all the messages with everyone, and get the cluster that is guaranteed to contain the same messages, which match not only in statistics, but in fact.

    So, we did clustering - and, nevertheless, it is very important for us (and for clustering) to know the truth about the user. If he tries to hide the truth from us, then we need to take some action.

    Information hiding


    The typical type of information hiding is VPN, TOR, Proxy, Anonymizer. The user uses them, trying to pretend that he is from America, although in fact he is from Nigeria.

    In order to defeat this problem, we took the most famous textbook "How to calculate by IP."



    With the help of this tutorial, we wrote a VPN classifier - that is, a classifier that receives an IP address as an input and at the output says whether this IP address is VPN, Proxy or not.

    To implement the classifier, we need several ingredients:

    1. The ISP (Internet Service Provider) database , i.e., the matching of IP addresses to all existing providers. Such a base can be purchased, it is not very expensive.
    2. Information from whois . There is a lot of different information available on the Whois IP address: country; provider; subnet, which includes the IP address; sometimes the fact that this IP address belongs to the hosting, etc. You can analyze the text on specific words and see what the IP address is all about.
    3. Base GeolP. If the base tells us that the IP address is in Norway, and all the users who use this Norwegian IP address are scattered across Africa, then there’s probably something wrong with that IP address.
    4. User statistics  - how many users of our service are blocked on this IP address, how many of them coincide with the data of GeolP, Whois, how many do not match.

    By taking all this information, you can build a classifier that will eventually say whether the IP address is a VPN or not.



    We chose decision trees because they are very good at finding the very patterns — specific combinations of providers, countries, statistics, etc., which ultimately make it possible to determine that the IP address is a VPN.

    Of course, this data is very general. No matter how well we train the classifier, no matter how hard we try to use advanced technology, it will still not work with 100% accuracy. Therefore, additional network checks are key here.

    As soon as we received information that the IP address allegedly belongs to a VPN, we can actually check what this IP address is. You can try to connect to it, see what ports are open on it. If there is a SOCKS-proxy, you can try to open the connection and accurately determine if this IP address is anonymizer or not.

    In addition, there is a great technology, the introduction of which we have in the plans, which is called p0f . This is a utility that, at the network level, does fingerprinting traffic and allows you to immediately determine what is on that side of the connection: a normal user client, a VPN client, a proxy, etc. The utility contains a large set of patterns that define all this.

    Most suspicious action


    After we wrote various systems, clusters, classifiers, collected statistics, we thought: what can the most suspicious user do on our service? Sign up - this is already suspicious! If a user is registered, we immediately begin to look at him with a very cunning squint and analyze it in every possible way, trying to understand what he meant.

    We often have an inner desire - instead of banning everyone immediately after registration? This would greatly facilitate the work of the Anti-Spam department. We can immediately drink tea 2 times longer, and we will have no problems.

    In order to stop such thoughts not only in ourselves, but also in the systems that we write, and not to ban all good users, especially immediately after registration, we have to create systems that fight with our other systems, that is, organize themselves restrictions.



    How can you limit yourself not to ban good users, not to be mistaken and not to get confused?

    "User Decency"


    We classify users according to honesty - we will make an isolated model that will take all the positive characteristics of the user and analyze them.

    An example of the characteristics of "good" behavior:

    • the length of the dialogs;
    • prescription of registration;
    • no complaints;
    • passed verifications;
    • purchases.

    A classifier or regression can be trained on the data set of all individual exceptionally good user characteristics. We have a simple logistic regression . This model is easily portable, since it is simply a set of coefficients, and it is easy to implement in any system and in any language, it is very well tolerated between platforms and infrastructures.

    Taking the user and driving him through this model, we get the coefficient, which we call the " coefficient of honesty ." If it is zero, then, as a rule, this means that we have almost no information about this user. Then we do not get any additional information from the classification.

    If the user’s honesty ratio is 1, then most likely the user is a good guy, we will not touch him - no verification and a ban will not come to him.



    Such an isolated thing allows us to prevent many common mistakes.

    False positive


    The second thing to do is look for various false positives. It happens that users accidentally log in from the same IP address. For example, two sit in an Internet cafe, even the computer may be the same. Browser, fingerprint, which we consider on the computer, on the browser, on the device - everything will be absolutely the same, and we can consider that both users are spammers, although it is not a fact that they are somehow connected.

    Another example: a good user in a dialogue with a spammer can ask again in response to an ad: “Hey, I didn’t understand what Pornhub is — why do you advertise it to me?” that this user is a spammer and should be banned as soon as possible.

    Therefore, we have to search for anomalies. We take users, their attributes, and look for among them those users who got into a bad company completely by accident.

    For example, take the stop word "Pornhub". For each stop word, we have statistics of all users who have ever used it.



    At some point, the new user Patrick uses the same stop word, and we have to add it to this bad company and ban it.

    Here you need to check whether the new user Patrick is different from all the old, already known spammers. You can compare its typical attributes: gender, age, provider, application, country, etc. Here it is important for us to understand how large the “distance” in this attribute space is between the user and the main group. If it is very large, then Patrick most likely got there by accident. He didn’t mean anything bad, he shouldn’t be banned right away, but it’s better to send him for a manual check.

    When we built such a system, we had much less typical false positives happening.

    Universal Mega Classifier


    You can ask - why not immediately make a great class system with MachineLearning, neural networks and decision trees, which will receive all information about users at the input and give out just 0 or 1 - a spammer or not.


    Trying to create one universal model, it is very easy to come to a situation when we face a black box that is difficult to control. In it, the good from the bad is not separated, the system is not isolated from itself, and only manual testing and indirect metrics are protected from errors. In addition, it is quite difficult to collect all the information and statistics on a large amount of data in order to feed the mega-system to the input.

    Moreover, all known machine learning systems are more than one model - a dozen models. Any voice assistant or face recognition system is a few models combined into one very complex system.

    As a result, it became clear to us that a much more correct (from our point of view) is the way, when separate classifiers and clustering systems are created, which solve their individual problems. Ideally, as in our case, a separate model is created for each type of spam and separately controlled in various ways: by other models, indirect metrics, and also manually. This is the only way to hopefully avoid most false positives.

    Come to HighLoad ++ 2018 , this year there will be a lot of reports on machine learning and artificial intelligence, for example:

    • Sergey Vinogradov will present the standardization of the life cycle of the ML-model and  show how to exploit them in production without incident.
    • Dmitry Korobchenko from NVIDIA will talk about algorithmic tricks that are used under the hood of real combat neural networks.
    • Artem Kondyukov will make an interesting use case of machine learning in the pharmaceutical industry.

    We collect videos of past reports on the  youtube channel , publish news on future topics in the  newsletter  - subscribe if you want to be aware of everything in the world of high loads.

    Also popular now: