The experiment in Yandex. How to identify a cracker using machine learning

    Yandex servers store a lot of necessary and important information for people, so we need to reliably protect the data of our users. In this article, we want to tell you about our research, in which we study how to distinguish an account holder from an attacker. And even when both have a username and password from the account. We have developed a method that is based on an analysis of the behavioral characteristics of users. It uses machine learning and allows you to distinguish the behavior of the current owner of the account from the attacker in a number of ways.

    Such an analysis is based on mathematical statistics and the study of data on the use of Yandex services. Behavioral characteristics are not enough to uniquely identify the user and thereby replace the use of the password, but this allows you to determine the hack after authorization. Thus, the stolen password from the mail will not pretend to be its real owner. This is a really important step, which will allow you to take a different look at Internet security systems and solve such complex tasks as determining the current account owner, as well as the moment and nature of the hack.

    It is believed that methods of recognizing a person have appeared relatively recently, but in fact, the history of various identification methods has its roots in the Middle Ages. It is known that in ancient China at the turn of the 14-15 centuries they already guessed to use fingerprints. True, they used this method to a limited extent - merchants thus signed trade agreements. At the end of the 19th century, the uniqueness of papillary lines formed the basis of fingerprinting, the founder of which was William Herschel . It was he who put forward the theory that the pattern of the palmar surfaces of a person does not change throughout his life.

    Herschel Fingerprint Card

    With the development of information technology, various user recognition systems have appeared. Most of these methods are designed so that a person can control access to some system, but in reality the area of ​​user identification and authentication is much wider.

    Scientists all over the world are struggling with the problem of identifying people on various grounds. There are different models and theories: from the most popular, where the already mentioned fingerprints, iris, voice are used for recognition, to new and controversial ones that take into account mouse movements, keyboard “handwriting” and website behavior. Yandex is also actively studying existing models and creating new ones. We are at the very beginning of the journey, but have already achieved some successes, so we want to tell you a little about our experiments.

    We are constantly working on algorithms for protecting mailboxes from hacking, spam, and malicious activity that could harm the user. Those access control methods that already exist make it difficult for attackers to penetrate the mailbox, but, alas, they do not completely solve the problem of hacking. The bottleneck remains the use of a password that can be lost, stolen, intercepted or picked up. For example, password interception can occur if you use the Yandex password on other services where a secure connection is not supported.

    We thought: “Is it possible to distinguish an attacker from the current account owner if both are authorized with the same password?” It turned out that yes. Our research has shown that the behavior of the owner of a mailbox is always different from how a cracker behaves.

    In general, a number of characteristics can be distinguished from user behavior in the mail: login time, usual location, number of authorizations, devices used, etc. There are operations that are not typical for a particular person. For example, deleting read letters, erasing folders, sending newsletters. A person may have a certain behavior when working with different types of letters: reading letters from people, deleting mailings, ignoring letters from social networks. In addition, there are such habits as “reading a chain of unread letters from the bottom up”, “logging in and going first to the Mail, then to the Disk and then to the News” and so on. Such behavior patterns can be calculated for many of our services. From the combination of these factors, the user’s profile is formed, which does not give a complete picture of the user himself, but allows you to distinguish the fact of an account hack from normal authorization. Of course, this approach cannot be effective without the use of machine learning. With its help, a set of factors that affect the profile and the boundaries for determining hacking is determined.

    The essence of this method is very simple: everyone has habits peculiar only to him, starting with the regime of work and rest, continuing with the places in which a person is, and the number of devices that he uses. For example, someone always checks mail from home and work, uses two devices, never deletes read letters and does not send spam. He uses mail during the day and never checks mail at night. And someone often travels for a month and periodically reads mail from different countries. These users will have different patterns of behavior, based on which you can build an individual profile and compare with it every new entry into the mail.

    This is how the profiles of two different people look. The red graph shows the profile of a regular non-hacked user. It can be seen that everything is fairly uniform, and there are no sharp jumps in the parameters. The blue graph illustrates the behavior of a suspicious account: all indicators jump strongly, a chaotic call to the resource is traced. This makes it possible to assume the fact of unauthorized access.

    And this graph shows the profile change at the time of hacking. In the blue region, it can be seen that the indicators are normal, while significant fluctuations are already visible in the red zone. In addition, the dates at which this happened are clearly visible, which can greatly simplify the search for a hack.

    This approach can protect users from theft of passwords and session cookies and will allow to detect hacking even after authorization in the account.

    We are not yet ready to talk about launching a fully working crack detection system. Not all puzzle details have been assembled yet - it will take time to fully appreciate and learn how to take advantage of these technologies. But their effectiveness is obvious now: the use of machine learning in information protection systems can greatly increase the security of stored data. So we will continue to work in this direction.

    Also popular now: