How the crowdsourcing platform of Yandex helps train drones and evaluate the quality of services

    In the work often there are long and monotonous tasks, which need a lot of people to solve. For example, to decipher a few hundred audio recordings, mark up thousands of images, or filter out comments whose number is constantly growing. For these purposes you can contain dozens of full-time employees. But all of them need to find, select, motivate, control, ensure development and career growth. And if the amount of work is reduced, they will have to be retrained or fired.

    In many cases, especially if special training is not required, Toloka , crowdsourcing Yandex, can take on such work . This system is easily scaled: if tasks from one customer become smaller, the tolokers will go to another, if the number of tasks increases, they will only be happy.

    Under the cut - examples of how Toloka helps Yandex and other companies develop their products. All headers are clickable - links lead to records of reports.

    Chat with chat bots and choose the best: MIPT experience

    MIPT used Toloka to assess the quality of chat bots as part of the DeepHack.Chat hackathon. It involved 6 teams. The task was to develop a chat bot, which can tell about itself on the basis of the profile given to it with a brief description of personal characteristics.

    Tolokers and bots received profiles and had to pretend to be a person in the dialogue, whose description was given there, tell about themselves and learn more about the interlocutor. Dialogue participants did not see each other’s profiles.

    Only users who passed the English language proficiency test were allowed to the task, since all chat bots within the hackathon spoke English. It was impossible to organize a dialogue with the bot directly through Toloka, so the task provided a link to the Telegram channel where the chat bot was launched.

    After talking with the bot, the user received the dialogue ID, which, together with the dialogue evaluation, was inserted into Toloka as an answer.

    To exclude unscrupulous tolokers, it was necessary to check how well the user spoke with the bot. To do this, they created a separate task, in the framework of which the performers read the dialogues and evaluated the behavior of the user, that is, the toloker from the previous task.

    During the hackathon, teams downloaded their chat bots. During the day, the pushers tested them, counted the quality, and informed the teams about the bill, after which the developers edited the behavior of their systems.

    Over the four days of the hackathon system have improved significantly. On the first day, the bots had inappropriate and repetitive answers; on the fourth day, the answers became more adequate and expanded. Bots have learned not only to answer questions, but also to ask their own.

    An example of a dialogue on the first day of the hackathon:

    On the fourth day:

    Statistics: the evaluation lasted 4 days, about 200 talkers took part in it and processed 1,800 dialogues. We spent 180 dollars on the first task, 15 dollars on the second. The percentage of valid dialogues was higher than when working with volunteers.

    How to teach a drone to recognize the surrounding objects

    The important task of the creator of the drone - to teach him to extract information about the surrounding objects from the data he receives from the sensors. During the trip, the car records everything that it sees around. This data is poured into the cloud, where the primary analytics is done, and then go to a post-processing, which includes the markup. Marked data is sent to machine learning algorithms, the result is returned to the machine, and the cycle repeats, improving the quality of object recognition.

    There are many different objects in the city, all of them need to be marked out. This task requires certain skills and takes a lot of time, and tens of thousands of pictures are needed to train a neural network. They can be taken from open datasets, but they are collected abroad, so the images do not correspond to Russian reality. You can buy tagged images for as low as $ 4, but it was about 10 times cheaper to do the markup in Toloka.

    Since you can embed any interface in Tolok and transfer data via API, the developers have inserted their own visual editor, which has layers, transparency, selection, magnification, division into classes. This has increased the speed and quality of the markup several times.

    In addition, the API allows you to automatically split tasks into simpler ones and collect the result from pieces. For example, before you mark the picture, you can note what objects it has. This will give you an idea on which classes to mark the image.

    After that, the objects in the image can be classified. For example, to offer Tolokers a selection of pictures where there are people, and ask them to clarify whether they are pedestrians, cyclists, motorcyclists or anyone else.

    When the toloker has done the markup, it needs to be checked. To do this, create test tasks that are offered to other performers.

    Not only pushers but also neural networks are engaged in markup. Some of them have already learned how to cope with this task as well as people. But the quality of their work also needs to be assessed. Therefore, in the tasks, in addition to the pictures marked up by the pushers, there are also marked up by the neural network.

    So Toloka integrates directly into the learning process of neural networks and becomes part of the pipeline of the entire machine learning.

    Evaluation of the quality of search in the online store Ozon

    Ozon uses Toloka to create a reference sample. It is necessary for several purposes.

    • Evaluation of the quality of the new search engine.
    • Determine the most effective ranking model.
    • Improving the quality of the search algorithm using machine learning.

    The first test sample was compiled manually - they took 100 requests and marked them themselves. Even such a small sample helped to identify search problems and define evaluation criteria. The company wanted to create its own tool for assessing the quality of the search, hire assessors and train them, but it would take too much time, so we decided to choose a ready-made crowdsourcing platform.

    Training turned out to be the most difficult stage of preparing the job for the tolokers - even the company's employees could not do the first test task. Having received feedback from the team, we developed a new test: we built training from simple to complex and made up tasks taking into account the qualities of the performer that are important for the company.

    To eliminate errors, in Ozon conducted a test run. The task consisted of three blocks - training, control with a threshold of 60% correct answers and the main task with a threshold of 80% correct answers. To improve the quality of the sample, one task was offered to five performers.

    Test run statistics: 350 tasks in 40 minutes. The budget was 12 dollars. 147 performers came to the first stage, 77 were trained, 12 acquired the skill and carried out the main task.

    The main launch scenario became more complicated: not only new pushers participated in it, but also those who acquired the necessary skill at the test stage. The first went on a standard chain, the second immediately admitted to the main tasks. In the main launch, additional skills were added - the percentage of correct answers in the main sample and the majority opinion. The task was still offered to five performers.

    Statistics of the main launch: 40000 tasks for one month. The budget was 1150 dollars. 1117 tolokers came to the project, 18 gained skills, 6 got access to the largest main pool and evaluate it.

    Now the Ozon task on Toloka looks like this:

    The contractor sees the search query and 9 products from the issue. His task is to choose one of the ratings - “suitable”, “not suitable”, “suitable for replacement”, “additional”, “does not open”. The latest assessment helps to detect technical problems on the site. In order to simulate user behavior as accurately as possible, the developers, through the iframe, recreated the online store interface.

    In parallel with the launch of the job on Tolok, the marking of search queries was performed using rules. The emphasis was placed on popular queries in order to first of all improve the issue of them.

    The markup rules made it possible to quickly obtain data on a small number of queries and showed good results on top queries. But there were also disadvantages: ambiguous requests cannot be evaluated by the rules, many controversial situations arise. In addition, in the long run, this method turned out to be quite expensive.

    Layout with the help of people covers these drawbacks. In Toloka, you can collect the views of a large number of performers, the assessment is more graduated, which allows you to work deeper with the issue. After initial setup, the platform is stable and processes large amounts of data.

    Manual labor and the mechanisms of artificial intelligence do not oppose each other. The more artificial intelligence develops, the more manual labor is required for its training. On the other hand, the better neural networks are trained, the more routine tasks can be automated, eliminating a person from them.

    Virtually any, even volumetric, task can be divided into many smaller ones and built on the basis of crowdsourcing. Most of the tasks that are solved in Toloka are the first step towards training models and automating processes on the data collected by people.

    In the next publication on this topic, we will talk about how crowdsourcing is used to teach Alice, moderate comments and monitor compliance with the rules in Yandex. Buses.

    Also popular now: