Akinator and Math
Akinator’s theme has surfaced several times on Habré , including with a tag I don’t know how it works . I came across him recently and, of course, I was delighted. Then, like probably many others, the thought came to my mind: “But how does it work?” I did not find an answer to this question anywhere, and therefore I set out to write a program similar in functionality, having figured out what was happening along the way.
The first step is to figure out what the words “similar in functionality to the program” mean. After thinking a little, I highlighted the following requirements:
If it weren’t forgiveness of mistakes, it would be quite simple to achieve what was desired. For example, it would be possible to store a tree of answers to questions in which internal vertices correspond to questions and sheets correspond to answers. The process of the game would then look like a descent from the root to one of the sheets. However, this algorithm will not cope with forgiveness of errors. Yes, and tree balancing issues arise.
In a sense, a tree is a very "mechanistic", "machine" way of playing, extremely unstable to the slightest inaccuracies. We need to play as a rational person would play. Those who are more or less familiar with the theory of probability should know that it has the so-called Bayesian interpretation , as well as the Bayesian approach based on it.. The basis of this approach is the description of knowledge using distributions of random variables with the subsequent transformation of a priori knowledge into a posteriori knowledge based on observations using the famous Bayes formula . Moreover, this approach is the only generalization of classical algebra of logic to the case of uncertainty (this can be read, for example, here ). This leads many scientists to the idea that the Bayesian approach is the standard of rational thinking. Well, we only need this. Let's try to apply it to our task.
So, we recall the Bayes formula: P (A | B) = P (B | A) P (A) / P (B). And now with words. Suppose we need to evaluate the probability that event A happened, provided that event B happened exactly (that is, we were guaranteed to observe it; that is why B is often called observation). According to the Bayes formula, this probability is proportional to the product of the other two. The first of them, P (B | A), is called likelihood and shows the probability with which event B occurs if A. happened. The second factor, P (A), is the so-called a priori probabilityevent A, that is, the probability that it will happen in principle (regardless of B). In fact, this probability reflects the information that we knew about A before we learned about what happened B. The denominator of the formula also contains the value P (B), which in this case simply plays the role of a normalization coefficient and can be ignored.
Using this formula in the context of a game of questions is pretty easy. Let's assume that Ai is an event of the form “you guessed object i”, where i can be either Spock or Virgin Mary. Since B is an observation about Ai, it would be natural to assume that B consists of answers to questions. The only option that I see here is to present B in the form of a joint event "Answer A1, ... was answered to question Q1, Ak was answered to question Qk." Then P (Ai | B) will show for the object i the probability that it was he who was conceived (taking into account the fact that the user gave answers to k questions). This is exactly the value that interests us. Choosing an object with a maximum value of P (Ai | B), you can, if the value of P (Ai | B) is large enough, try to use it as a guess.
The a priori probability P (Ai) can be considered as a special case of P (Ai | B) for k = 0. In other words, it is the probability that the player has made an object i, provided that no questions have been asked, and we do not know anything. On the one hand, it would be possible to give all objects equal P (Ai), because that is honest. On the other hand, Barack Obama will probably be guessed much more often than Holden Caulfield. Therefore , ceteris paribus (that is, when we cannot distinguish between objects), Obama should be chosen. Therefore, the natural estimate of P (Ai) will be the ratio of the number of games when X was made up to the total number of them.
The likelihood of P (B | Ai) also gets a convenient interpretation. Only first you need to use one small trick - to assume the conditional independence of answers to questions on condition Ai (somewhat crude, but very convenient for us simplification). Translated into Russian, this means that, by assumption, the probability P (B | Ai) can be written as the product (by j) of the probabilities P (Bj | Ai), where Bj is an event of the form “Aj was answered to question Qj”. P (Bj | Ai) in this case will be the ratio of the number of times when, with a hidden object i, the question Aj was answered by Aj to the number of times when, with a hidden object i, in principle, the question Qj was asked. In order to avoid zero and uncertain probabilities, I propose to additionally consider that initially each of the questions was answered once for each of the options. That is, in the case
To summarize the subtotal. We found a simple formula that displays a set of question / answer pairs and some entity in the likelihood that with these answers to the questions it was this entity that was made up. Having recounted this probability for all objects in our database after answering a new question, you can see which of them are more like the hidden object at the moment. Moreover, training our model is implemented quite simply: you just need to store information about what questions were asked about it and how many answers of each type users gave to the database for each entity. After each game, this information can be updated based on user responses. Also, to take into account the "popularity" of a person in the database, you need to store the number of times that the person was made up.
Well, all that remains is to understand what questions are best to ask. Naturally, you need to ask those questions that give more information. But can we somehow measure this information? It turns out that yes. To do this, we can use the concept of information entropy. Speaking roughly, but understandably, information entropy is such a characteristic of the distribution of a random variable (measured, like information, in bits), which shows how unsure we are of what value this random variable will take. For example, if a random variable takes a value of 1 with a probability of 0.99, and a value of 0 with a probability of 0.01, then the entropy of such a distribution will be very close to zero. If the random variable takes, for example, the values 0 and 1 with equal probabilities of 0.5 (heads or tails), then the entropy of such a random variable will be equal to 1 bit (this is exactly the amount of information that we must obtain in order to eliminate the uncertainty).
Okay, let's choose the question each time, the answer to which will most of all reduce the entropy of the distribution P (Ai | B), which is precisely responsible for our knowledge about who the player made up. Here another problem immediately arises: generally speaking, different answers to the same question can reduce entropy in different ways. What to do? It is proposed to find the question for which the expected decrease in entropy will be maximum. The expected decrease in entropy shows how much “on average” entropy will decrease if we ask some question. In order not to write here a few more paragraphs of the text, I will give a formula by which this value can be calculated. Those who wish can easily understand why it has such a look. So, each time we need to ask such a question j for which the quantity H [P (Ai | B,)] P () + ... + H [P (Ai | B, )] P () is minimal. Here H [P] denotes the entropy of the probability distribution P, and by ""- the event" Ans is answered to the question Qj. "The value of P () can be easily found by the formula of total probability , summing it up, due to all known objects. That is, P () = sum (i) P (| Ai) P (Ai | B).
It turns out that this approach allows you to very quickly discard irrelevant issues, focusing on the most important. In a sense, this method is a generalization of the “halving” method in a probabilistic setting. You can see how it all works together in the video below.
I hope the information in this short article has been interesting to some of you. In fact, first of all, this article should draw your attention to the power of (even the simplest) mathematics in solving poorly formalized problems. Using the Bayesian approach to probability theory and information theory in your programs can make them more rational, but at the same time more humane :)
Functional requirements
The first step is to figure out what the words “similar in functionality to the program” mean. After thinking a little, I highlighted the following requirements:
- The program must be trained. It is obvious that you cannot teach a program to recognize several hundred thousand characters by manually entering answers to questions for each of them. Rather, theoretically this is possible, but we will look for more beautiful solutions. An obvious alternative to this approach is to learn on the go using user responses. This is our program must be able to.
- The program should forgive mistakes. Obviously, the opinions of users regarding the answers to some questions can vary widely. A simple example: a desperate homophobe and a simple man make Brad Pitt. To the question “Is your character sexy?” the first is likely to answer negatively, while most people do otherwise. However, this slight discrepancy should not prevent our program from finding out the truth.
- The program should choose questions wisely. There are many strategies that determine the questions to ask. For example, you can ask all the questions at all (that's just a lot of them). You can ask random questions, but even then, most likely, the answer will have to go a very long time. And you can try to choose the next question so as to find out as much information as possible when answering it. That is what we will try to do.
Algorithms
If it weren’t forgiveness of mistakes, it would be quite simple to achieve what was desired. For example, it would be possible to store a tree of answers to questions in which internal vertices correspond to questions and sheets correspond to answers. The process of the game would then look like a descent from the root to one of the sheets. However, this algorithm will not cope with forgiveness of errors. Yes, and tree balancing issues arise.
In a sense, a tree is a very "mechanistic", "machine" way of playing, extremely unstable to the slightest inaccuracies. We need to play as a rational person would play. Those who are more or less familiar with the theory of probability should know that it has the so-called Bayesian interpretation , as well as the Bayesian approach based on it.. The basis of this approach is the description of knowledge using distributions of random variables with the subsequent transformation of a priori knowledge into a posteriori knowledge based on observations using the famous Bayes formula . Moreover, this approach is the only generalization of classical algebra of logic to the case of uncertainty (this can be read, for example, here ). This leads many scientists to the idea that the Bayesian approach is the standard of rational thinking. Well, we only need this. Let's try to apply it to our task.
Bayesian Model
So, we recall the Bayes formula: P (A | B) = P (B | A) P (A) / P (B). And now with words. Suppose we need to evaluate the probability that event A happened, provided that event B happened exactly (that is, we were guaranteed to observe it; that is why B is often called observation). According to the Bayes formula, this probability is proportional to the product of the other two. The first of them, P (B | A), is called likelihood and shows the probability with which event B occurs if A. happened. The second factor, P (A), is the so-called a priori probabilityevent A, that is, the probability that it will happen in principle (regardless of B). In fact, this probability reflects the information that we knew about A before we learned about what happened B. The denominator of the formula also contains the value P (B), which in this case simply plays the role of a normalization coefficient and can be ignored.
Using this formula in the context of a game of questions is pretty easy. Let's assume that Ai is an event of the form “you guessed object i”, where i can be either Spock or Virgin Mary. Since B is an observation about Ai, it would be natural to assume that B consists of answers to questions. The only option that I see here is to present B in the form of a joint event "Answer A1, ... was answered to question Q1, Ak was answered to question Qk." Then P (Ai | B) will show for the object i the probability that it was he who was conceived (taking into account the fact that the user gave answers to k questions). This is exactly the value that interests us. Choosing an object with a maximum value of P (Ai | B), you can, if the value of P (Ai | B) is large enough, try to use it as a guess.
The a priori probability P (Ai) can be considered as a special case of P (Ai | B) for k = 0. In other words, it is the probability that the player has made an object i, provided that no questions have been asked, and we do not know anything. On the one hand, it would be possible to give all objects equal P (Ai), because that is honest. On the other hand, Barack Obama will probably be guessed much more often than Holden Caulfield. Therefore , ceteris paribus (that is, when we cannot distinguish between objects), Obama should be chosen. Therefore, the natural estimate of P (Ai) will be the ratio of the number of games when X was made up to the total number of them.
The likelihood of P (B | Ai) also gets a convenient interpretation. Only first you need to use one small trick - to assume the conditional independence of answers to questions on condition Ai (somewhat crude, but very convenient for us simplification). Translated into Russian, this means that, by assumption, the probability P (B | Ai) can be written as the product (by j) of the probabilities P (Bj | Ai), where Bj is an event of the form “Aj was answered to question Qj”. P (Bj | Ai) in this case will be the ratio of the number of times when, with a hidden object i, the question Aj was answered by Aj to the number of times when, with a hidden object i, in principle, the question Qj was asked. In order to avoid zero and uncertain probabilities, I propose to additionally consider that initially each of the questions was answered once for each of the options. That is, in the case
To summarize the subtotal. We found a simple formula that displays a set of question / answer pairs and some entity in the likelihood that with these answers to the questions it was this entity that was made up. Having recounted this probability for all objects in our database after answering a new question, you can see which of them are more like the hidden object at the moment. Moreover, training our model is implemented quite simply: you just need to store information about what questions were asked about it and how many answers of each type users gave to the database for each entity. After each game, this information can be updated based on user responses. Also, to take into account the "popularity" of a person in the database, you need to store the number of times that the person was made up.
Choice of questions, information and entropy
Well, all that remains is to understand what questions are best to ask. Naturally, you need to ask those questions that give more information. But can we somehow measure this information? It turns out that yes. To do this, we can use the concept of information entropy. Speaking roughly, but understandably, information entropy is such a characteristic of the distribution of a random variable (measured, like information, in bits), which shows how unsure we are of what value this random variable will take. For example, if a random variable takes a value of 1 with a probability of 0.99, and a value of 0 with a probability of 0.01, then the entropy of such a distribution will be very close to zero. If the random variable takes, for example, the values 0 and 1 with equal probabilities of 0.5 (heads or tails), then the entropy of such a random variable will be equal to 1 bit (this is exactly the amount of information that we must obtain in order to eliminate the uncertainty).
Okay, let's choose the question each time, the answer to which will most of all reduce the entropy of the distribution P (Ai | B), which is precisely responsible for our knowledge about who the player made up. Here another problem immediately arises: generally speaking, different answers to the same question can reduce entropy in different ways. What to do? It is proposed to find the question for which the expected decrease in entropy will be maximum. The expected decrease in entropy shows how much “on average” entropy will decrease if we ask some question. In order not to write here a few more paragraphs of the text, I will give a formula by which this value can be calculated. Those who wish can easily understand why it has such a look. So, each time we need to ask such a question j for which the quantity H [P (Ai | B,
It turns out that this approach allows you to very quickly discard irrelevant issues, focusing on the most important. In a sense, this method is a generalization of the “halving” method in a probabilistic setting. You can see how it all works together in the video below.
Total
I hope the information in this short article has been interesting to some of you. In fact, first of all, this article should draw your attention to the power of (even the simplest) mathematics in solving poorly formalized problems. Using the Bayesian approach to probability theory and information theory in your programs can make them more rational, but at the same time more humane :)