In search of an optimal point of application of human resources
One of the paradoxes of modern Internet platforms is that, although they are substantially automated and the content that end users see is shown without any human moderation, nevertheless, they are completely dependent on human behavior, because in fact, they only observe, receive information and draw conclusions based on the actions of hundreds of millions or billions of people.
The origin of this principle was PageRank. Instead of relying on manually created rules that would provide an understanding of the meaning of each individual page, or working with the original text, PageRank observes what exactly people did or said about such a page. Who is in any way connected with it, what text they used, and who is connected with people associated with this page? At the same time, Google gives each user the opportunity to rank (index, rate) each set of search results manually: you are given 10 blue links, and you just tell Google which one is suitable. The same is true for Facebook: Facebook doesn’t really know who you are or what you are interested in or what this or that content is about. But he knows who you are following, what you like, who else but you like this and what else they like and what they subscribed to. Facebook is a human-oriented PageRank. In general, the same applies to YouTube: he never knew what the particular video was about, but only what people wrote under it and what else they watched and liked.
At its core, these systems are huge “mechanical Turks.” After all, they absolutely do not understand the content of the content with which they work, they are only trying to create, capture and convey the human sentiment regarding this content. They are huge distributed computing systems in which people act as processors, and the platform itself is a combination of routers and interconnections. (This reminds me a bit of the idea from the “Hitchhiker's Guide to the Galaxy” book that the whole Earth is actually a huge computer that performs certain functions, and our daily activities are part of the calculations).
This means that much of the design of the system is tied to finding the optimal points of application of human resources in working with an automated system. Are you capturing what is already happening? So Google started using links that already existed. Do you need to stimulate activity in order to reveal its value? Facebook had to create an activity themselves before they could get any benefit from it. Maybe you rely heavily on human resources? This approach is used in Apple Music, with their manually selected playlists, which are automatically issued to tens of millions of users. Or do you have to pay people to do everything at all?
Initially, Yahoo’s Internet Resource Directory was an attempt to take the “pay people to do everything” approach — Yahoo paid people to catalog the entire Internet. At first it seemed achievable, but as the Internet grew too fast, it soon proved to be an overwhelming challenge, and when Yahoo surrendered, their catalog size already exceeded 3 million pages. PageRank solved this problem. On the contrary, Google Maps uses a large number of cars with cameras that are controlled by people (for now) and drive almost all the streets in the world and many more people look at these photos, and this is not an overwhelming task - it just costs a lot. Google Maps is such a private “mechanical Turk." Now we are investigating the exact same question, speaking of moderation of content by people - how many tens of thousands of people will you need to view each post and how much can this task be automated? Is this task overwhelming or is its implementation just very expensive?
If you look at these platforms as using billions of people to do real computing, this should raise two interesting questions: what vulnerabilities exist in such platforms and how can machine learning change this area?
In the past, when we thought about hacking computer systems, we had the idea of various technical vulnerabilities - stolen or weak passwords, open vulnerabilities in systems, bugs, buffer overflows, SQL injections. We represented “hackers” looking for holes in software. But, if you imagine that YouTube or Facebook are distributed computer systems in which the usual software acts as routers, but people play the role of processors, then any attacker will immediately think about finding vulnerabilities not only in software, but also in people. Typical cognitive biases begin to play the same role as typical defects in software.
That is, in fact, there are two ways to rob a bank - you can bypass the alarm system and pick up a master key for a safe, or you can bribe a bank employee. In each of these examples, the system failed, but now one of the systems is you and I. Therefore, as I wrote in this article about the recent change of Facebook’s course towards privacy and user safety, moderation of content by living people on such platforms is inherently similar to the work of antiviruses, which began to develop rapidly in response to the appearance of malware on Windows two decades ago . One part of the computer is watching if the other part of it is doing something that it should not do.
Even if we do not talk about deliberate hacking of systems, there are other problems that arise when trying to analyze the activity of one person with the help of another person. So, when you start using a computer to analyze another computer, you run the risk of creating feedback loops. This is reflected in concepts such as the “filter bubble,” “radicalization of YouTube,” or search spam. At the same time, one of the problems Facebook has encountered is that sometimes the availability and production of a large amount of data will offset the value of this data. We will call this the problem of overloading the news feed: for example, you have 50 or 150 friends and you publish 5 or 10 entries every day, or something like that, but all your friends do exactly the same and now you have 1,500 entries in your feed every day.
“Any observed statistical pattern is prone to destruction as soon as pressure is exerted on it to control it.” - Charles Goodhart
Yet how can machine learning make a difference? Earlier, I already said that the main difficulty is how to use human resources in working with software in the most optimal way, although there is another option - just let the computer do all the work. Until very recently, the difficulties and reasons why such systems existed, primarily consisted of a large class of tasks that computers could not solve, although people solved them instantly. We called it “tasks that are easy for a person, but difficult for a computer”, but in reality they were tasks that were easy for a person, but which a person is practically unable to describeto the computer. A breakthrough feature of machine learning is that it allows the computers themselves to develop the necessary description.
The comic below (straight from 2014, just when machine learning and computer vision systems began to develop rapidly) perfectly illustrates these changes. The first task was easily accomplished, unlike the second, at least until the advent of machine learning.
The old way to solve this problem is to find people who would classify the image - to resort to some kind of crowdsourcing. In other words, use a “mechanical Turk." But today, we may no longer need anyone to look at this image, because with the help of machine learning, we can very often automate the solution of this particular problem.
So: how many problems could you solve before using an analysis of the actions of millions or hundreds of millions of people that you can now solve using machine learning and generally without the need to engage users?
Of course, there is some contradiction in this, because in machine learning you always need a large amount of data. Obviously, in this case, someone could say that if you have a large platform, you automatically have a lot of data, therefore, the machine learning process will also go easier. This, of course, is true, at least in the beginning, but I think it would not be out of place to ask how many tasks could be solved onlywith the help of existing users. In the past, if you had a cat photo, it could be marked as “cat” only if you had enough users, and one of them would look at that particular photo and tag it. Today, you don’t need real users at all to process this particular image of a cat - you just need to have any other users, anywhere in the world, at some point in the past who have already classified enough other images with cats to generate the necessary recognition model.
This is just another way of making the best use of human resources: in any case, you need people to classify objects (and to write rules by which people will classify them). But here we are already shifting the lever and, possibly, radically changing the number of people needed, and thus, the rules of the game, to some extent, are changing due to the effect of “the winner gets everything.” In the end, all these large-scale social networks of the platform are just huge collections of manually classified data, since in the end it turns out that their glass is half full or half empty? On the one hand, it is half full: they have at their disposal the largest collection of manually classified data (in their specific field of activity). On the other hand, the glass is half empty: this data was selected and classified manually.
Even where the data could form one of these platforms (which, most likely, this will not happen - certainly will not happen - as I wrote here ), they would still become, well, a platform. As with AWS, which enabled startups that no longer needed millions of users to have economies of scale for their infrastructure, creating such tools would mean that you would no longer need millions or billions of users to recognize a cat. You can automate the process.
Translation: Alexander Tregubov
Editing: Alexey Ivanov