How to determine the mobile operator and home region by phone number

From the sandbox

When we try to determine which operator a phone number belongs to, we usually look at its DEF code. For example, if the number starts at 916, then this is MTS, at 968 - Beeline, 926 - Megaphone (it all depends on your region). But this method is very conditional and completely inappropriate when you need accurate data. In reality, everything is more complicated: DEF codes are often divided among themselves by several operators, and it is not necessary that the desired number refers to the operators of the Big Four. And finally, you can simply port the number.

In the article I will talk about how to reliably identify the mobile operator that serves it by the phone number, as well as get additional, “free” information - the subscriber’s home region. You can use this data as you like, starting from prefilling the address in the user profile and redirecting to the regional version of your service, to using this data in processing and statistics. At the end of the article there will be a link to github with source codes.

I’ll immediately make a reservation that the home region of the subscriber, by and large, is not connected in any way with the current location of the user, i.e. the defined region answers the question “Where is the number from?” rather than “Where is the user?”.

Data sources

Rossvyaz

We get our phone number when we conclude a service contract with a telecom operator. In turn, the distribution of number ranges between telecom operators, as well as standardization and general control of telecommunication services are carried out by the relevant state and international organizations. In Russia, such an organization is the Federal Communications Agency (Rossvyaz).

Thus, Rossvyaz is the most reliable source of information who services the Russian phone number, and this is open data that the agency publishes on its website: www.rossvyaz.ru/opendata . A fresh list of mobile number ranges can be found in CSV here . Each line in the file looks like:

DEF-код, начало диапазона, конец диапазона, название оператора, название региона

However, since 2013 it became possible to transfer the number from operator to operator. So, being guided only by the Rossvyaz registers, it is impossible to say unequivocally that the number is serviced by a specific operator. But this can be said about the region, because number portability works only within the home region and it will not work to transfer the number from MTS Novosibirsk to Tele2 St. Petersburg in principle.

Thus, if the task is to determine only the user's region, then Rossvyaz registries will be enough.

Transferred Numbers Database

If you need to accurately determine the operator, then you can not do without the Database of Transferred Numbers , the operator of which is TsNIIS . The procedure for connecting to the database can be found on their website: zniis.ru . But, unfortunately, as far as I know, it’s not easy to directly connect to them, but once you get the connection, you cannot share the database with anyone.

The structure of this database is extremely simple: these are three CSV files in which in the format “number, operator name” are listed:

All transferred numbers for the current day (updated once a day);
All transferred numbers for the last hour (updated once per hour);
All numbers returned to the native operator back in the last hour (updated once per hour).

At the time of writing, there are about 6 million records in the BDPN.

Summarizing: we have certain ranges of numbers that correspond to certain operators and regions (Rossvyaz), and a list of exception numbers from these ranges (BDPN), which applies only to the name of the operator.

How to identify subscribers

The most obvious solution to this problem is to look at the word "range" and use the listed capacities literally. Those. to determine the number, we sort all the operators by their ranges and look for a record that refers to the minimum range that a specific number falls into. The complexity of this algorithm will be like binary search, which is pretty good.

But there is a more original and universal way of implementation, the complexity of which is constant, regardless of the size of the data. This method involves using number masks.

Number mask

The number mask is a string consisting of numbers and a special character with the value "wildcard of a single character" ("?"), Which says that any digit can be in its place. Moreover, after the question mark can only be a question mark.

Thus, one of the Beeline bands in Moscow “79031000000 - 79031999999” , in the form of a mask will be written as “79031 ??????” .

It is very convenient to work with such masks, for example, to set them manually in the configuration. In addition, representing ranges in the form of masks makes it possible to use more efficient storage methods and simple search algorithms.

Hash table

For example, one of such algorithms is storing the mask-operator correspondences in a hash table (or any other key-value storage). The essence of the algorithm is as follows: all such masks are added to a hash table, where they are keys. The values in the table are operator objects with regions.

The search operation is most clearly explained by example. Let's say we are looking for information by number: 7 (903) 100-1234 , and we have a mask 79031 ?????? - Beeline, Moscow.

First, we look for a key record in the table exactly as the original number: 79031001234 .
If not found, then change the last digit of the number to "?" and look for the key 7903100123? .

If you didn’t find anything again, then again change the last digit to “?” and look for 790310012 ??, and so on.

In the end, we will do a search on the key 79031 ?????? and find that the number refers to the operator Beeline, Moscow.

It can be seen that in this case, the complexity of the algorithm is equal to the complexity of several takes from the hash table, which, if implemented correctly, is usually equal to a constant. The complexity of the search in such a tree depends on the length of the phone numbers, which, according to the recommendation of ITU-T E.164, does not exceed 15 characters.

The same algorithm can be applied to ported numbers - you can simply add them to the same hash table.

Prefix tree

A much more efficient method is the construction of a prefix tree from masks, which will rely on the fact that numbers consist of numbers. Each node of this tree can have up to 10 digital descendant nodes (0-9) and one wildcard node. A wildcard node can only have wildcard children. When you add another mask to the tree, each mask symbol will turn into a node in sequence. Thus, in fact, we represent all the masks we have in one tree.

For example, a tree consisting of masks:
7913? - Mno1
791 ?? - Mno3
7952 - Mno2
7953 - Mno3
795? - Mno1
will look like in the picture (the listed masks in the tree go from left to right).

The search algorithm in the tree, I think, is already clear: we take in order each digit from the desired number and sequentially go down the tree starting from the root. First of all, we go down the digital nodes, if there are no digital nodes, then we look to see if there is a "?" - node. If there is, then in the end we check the length of the mask, and if it matches the number, then the operator is found.

Conclusion

Depending on the limitations, you can combine these approaches and separate the repositories of ported numbers and Rossvyaz masks. For example, from memory it is more advantageous to use a hash table approach for ported numbers, and it is always more profitable to use a mask tree for Rossvyaz registries. When searching, first look in the table, and if nothing is found in it, then look in the tree. Separation of storages is primarily convenient for their auto-update, i.e. if the BDPN has changed (and it is constantly changing), then it is not necessary to re-read the ranges of Rossvyaz.

For maximum performance, you can store all the information directly in RAM. In my implementation in Java, the Rossvyaz mask tree takes up no more than 20-30 MB, a hash table with port number masks: about 500-600 MB. If the ported numbers are stored in a prefix tree, then due to the fact that the tree nodes are very sparse, the memory will need about 1.5 times more. But on the other hand, this gives a fairly significant increase in productivity.

Thanks for attention!

→ All source code is available on github .

Tags: