How we use mathematical statistics to measure data quality in Yandex.City

    Petka and
    Vasily Ivanovich fly on a plane , Vasily Ivanovich shouts:
    - Petka, appliances!
    Petka replies:
    - Two hundred!
    Vasily Ivanovich:
    - And what about “two hundred”?
    - And what about “appliances”?

    Today our new service, Yandex.City , comes out of beta . It appeared as a logical continuation of Yandex.Directory, which was a single source of knowledge about organizations for all our services. Its data is used in the Y. Gorod application itself, on Yandex.Maps, in snippets on the search results page, for building routes in Maps and Navigator, determining a number in Yandex.Kit, choosing departure and arrival points in Taxis. It was possible to find places and organizations at many of our sites, but choosing there is not very convenient.

    We realized that users need a separate service for this. But to understand and do is not the same thing. In this post I want to talk about how we chose metrics in order to measure what happens, what unobvious discoveries awaited us along the way, and indeed why it is not easy to evaluate the quality of data across Russia or even individual cities.

    If you have your own business, and if you are a hired manager, it is very important for you to be able to measure business performance. How do you understand that everything works well or badly? How do you verify that changes have led to improvement? What will you be based on when making decisions? For all this, metrics are needed - quantitative characteristics of the state of the system.

    The service for finding places on Yandex has a long history, and several teams had a hand in its creation. It grows from the project. Then Yandex integrated the Yellow Pages business into it - this is how the Directory appeared. About a year ago, the service team was very much updated. And he began to turn into Yandex.City. I am in charge of the data production service in this team and today I will tell you about our metrics and how they help us make the best base of organizations in Russia.

    How we chose metrics

    In fact, once upon a time we lived without them at all. The fact that we were able to develop some kind of assessment system, which we began to focus on, in itself was a great achievement for the whole team. At the next stage, we began to reflect on whether we have good metrics.

    One of our benchmarks has been the number of service providers (POIs) known to the service . But upon closer inspection, this metric turned out to be rather meaningless. It is certainly convenient and useful for operational management, because it is easy to read and it is clear to everyone. But we are making a product for users, and their happiness was not reflected in it.

    Judge for yourself whether the user is better off because we knew 50K organizations, but began to know 60K organizations? Maybe. And if we still don’t know the nearest pharmacy or the nearest ATM of the right bank near his house?

    But if we have chosen the number of organizations for ourselves as a metric, then before we answer a man his question, we need to clarify one important point. But should this ATM be considered an organization? And the kiosk of Rospechat? What about a public toilet? And the automatic kiosk selling tickets for the metro or train?

    It seems that summarizing the needs of the user, the product manager can answer these questions. But there are many products and each has its own manager with its own data requirements. How to reduce them into a single and not inconsistent order, given the fact that each customer cannot know all the details about the base?

    Realizing that I did not need to wait for someone to give an answer for me, I wrote some consolidated definition of the organization and put it in a public place. At the bottom I attributed: "Submit reasoned suggestions for changing this document to the following list of people." This allowed us to start work in parallel with endless discussions on how to live right.

    As we discussed above, the metric “total number of known organizations” is not the best, as it does not help us understand how well we solve user problems. And to do this is our main goal.

    Let's see how a set of metrics can describe our project. Pretty obvious metrics come to mind:

    • The above-mentioned number of organizations - both absolute and in comparison with the main competitors (including in certain categories).
    • The number / percentage of irrelevant responses that were triggered due to the organization not being in the database. A convenient metric, but difficult to measure, especially on a national scale.
    • Accuracy - the proportion of organizations with the correct data among all organizations available in the system.
    • Completeness - the proportion of organizations known to Yandex among all organizations in the real world.

    And we chose the last two metrics as those that best reflect the quality of the system and the complexity of the measurement that suits us.

    As we considered metrics

    It seems that life was getting better: we determined metrics, regularly measured them, and thought out measures to improve them. Live and be happy. Inside the company, they were extremely eager to discuss metrics in terms of “accuracy should be at least X” or “completeness should be at least Y”. It seemed clear to everyone what accuracy is - and it, of course, should be as high as possible. It’s not very interesting what we consider an organization, what we consider a mistake affecting accuracy, and so on.

    But when we analyzed how we actually measure accuracy, it turned out that different parties often understand this as slightly different. And it would have been possible to save many hours spent on lively meetings, if it had immediately become clear - we speak different languages ​​and therefore cannot agree. Our data is used by different Yandex services, and each service has its own accuracy requirements. But an amazing story: at the same time, they simply require “data with an accuracy of at least X”. The big surprise for everyone was that everyone understands accuracy very differently.

    As a result, we have described several accuracy metrics that reinforce one another in sequence.

    Thanks to this, we saw that if according to the basic metric used from the very beginning, our accuracy is really high (above 90%), then for the rest it can reach 50-60%. And it is precisely the nesting of metrics in each other that allows us to consistently work on the quality of the database, moving from one metric to another.

    Naturally, all such measurements occur on random samples of organizations, and this entails another insidious mistake. Often people forget that any indirect metric has an error. That is, if six months ago, the measurement accuracy was 62%, and now after the completion of a project it has become 63%, then this still does not mean anything; it’s too early to beat drums. Such a citation of statistics is generally unprofessional unless an error is indicated at the same time.

    The second mistake when working with similar metrics is the use of some point metrics. For example, depending on the significance of the error, put a “rating” from 1 to 5. You can get an acceptable estimation error for the imputed measurement costs only in the case of a binary metric — that is, the data on a particular organization is either accurate or not.

    As mentioned above, metrics are measured based on random samples. Random samples are manually checked by people, and based on this, the final metric is estimated. The larger the sample, the higher the labor costs for checking it (marking). In an ideal world that does not exist, labor costs nothing and checking the organization takes infinitely little time. Therefore, there, in an ideal world, it would be logical to check the entire base and calculate the value of accuracy. Then, by the way, I would not have to constantly make corrections for errors. However, reality makes its own adjustments and I want to determine the optimal sample size. Mathematical statistics come to the rescue.

    Suppose we have N = 40,000 in our databaseorganizations. We are interested in the sample size n, which must be marked out in order to make a reliable conclusion on the basis of the accuracy of the entire database. Note that based on selective markup, it will be possible to draw only a conclusion of the following form:

    “The real accuracy of the base deviates from a certain number p by no more than δ with a probability of at least P”.

    So, we have three values: n is the sample size, δ is the estimation error, P is the probability of this estimate. Of course, we want to minimize our costs for marking up n organizations, minimize the error in estimating δ, and maximize the probability P, with which we can argue that the results obtained on the basis of the sample are applicable to the entire base as a whole. But, as in the famous joke “Quickly, efficiently, inexpensively - choose any two points”, we need to sacrifice something. We will not sacrifice the P probability much, otherwise the obtained estimate will only with a low probability reflect the real state of affairs, and why, then, was the measurement taken? We agree on the probabilities P = 95% and the permissible error δ = 5%. And the size of the minimum sample is now easy to calculate. And what can not but rejoice, we can certainly say that it will be enough to mark up a sample of only 384 organizations!

    Suppose we rated a sample of size n= 384 organizations, and of them m = 290 organizations turned out to be true. Then we have the following assessment: . And the estimation error is δ % . Total, we can say that our accuracy is 75.5 ± 4.3% . If we suddenly have requirements to reduce the error, it will be enough for us to additionally mark up some more random number of organizations, the number of which is calculated on the basis of the presented error requirements.

    As already noted, having obtained, for example, 72.2 ± 4.5% in the next measurement, we will not be able to say anything definite about whether our accuracy has increased or decreased.

    Now we turn to the other defining metric of each directory - the metric of completeness. Completeness is the proportion of real-world organizations presented in the handbook. So, if in the city of X there are 10,000 organizations of interest to you, and you know only 6,000 of them, then your fullness is 60%.

    This metric can be measured in an “honest” way only by visiting organizations in the real world — that is, the metric itself, from the point of view of labor costs, is quite expensive. Fortunately, for the same reasons as above, a random stream of only 384 organizations is enough for us to measure completeness with an acceptable margin of error.

    The measurement process itself is as follows. The first thing we need to do is to create a random sample of addresses (house numbers) based on the address database we have on the Maps. Next, you need to visit each house and, according to the chosen definition, describe all the organizations present there.

    A random sample of the collected organizations is checked for the presence in Yandex.City. The completeness estimate and the error of this estimate are calculated using formulas that we already know.


    In addition to technical things, I would like to highlight those managerial lessons that I learned during my work in Yandex.City.

    • [captain evidence] Metrics are needed to manage any process or project. At the same time, if possible, they should be simple and understandable. There is nothing wrong with complex metrics, but in the case of a large distributed team, there must be a simple metric shared by all team members.
    • When evaluating any metric and comparing metrics at different points in time, it is important to understand the error of this estimate.
    • When evaluating metrics on random samples, it is required to use binary metrics, since only this approach allows us to achieve acceptable estimation accuracy with a relatively small number of measurements.
    • If you bring some kind of metric, do not be lazy, find out exactly how it is calculated. Ask for the source, formula, description of the calculation algorithm, examples of markup. Find out the sizes of the samples, and how they were generated. It may come as a surprise that some offline metric, measured once a year, is usually considered on the data for the last month - that is, without seasonality.

    We examined two metrics - accuracy and completeness, which, on the one hand, allow us to adequately judge how well we reflect organizations from the real world in our system, and on the other hand, they are sufficiently clear for the whole company and inexpensive to measure.

    Next time we will talk about how to measure an offline metric of completeness in cases where there is no good map for the city, which in our case means that we do not have address information and a complete list of “houses”.

    Also popular now: