# Yet Another Rating System

So, the topic of rating systems continues to excite the minds of habra-users. There are more and more new schemes, formulas, tests. And each time it comes down to the same question: how to combine the average user rating with our confidence in this rating. For example, if one film received 80 positive and 20 negative votes, and the other 9 positive and 1 negative, which film is better? Without claiming to create a new universal rating system, I will nevertheless propose one of the possible approaches to solving this particular issue.

In general, the wording itself - to evaluate a certain value and our confidence in it - suggests the use of a probability distribution model, for example, a normal distribution.

Rating, as a rule, is just a number, some kind of final result of evaluation. And in fact, we will evaluate the estimated quality of the film (coffee grinder, article, user - you need to emphasize). The graph below shows the distribution graphs for two hypothetical films .

The first film (blue line) caused conflicting reviews (the average distribution value is 0.5). In contrast, the second film (the green line) received more positive than negative ratings, however, fewer people voted, therefore, as a result, we are much less confident (variance is much greater than in the first graph).

In principle, a normal distribution in itself already allows a good rating simulation (a theoretical justification for this givescentral limit theorem ). However, statistics have a more convenient tool for this.

Like the normal one, the beta distribution is defined by two parameters - alpha> 0 and beta> 0 (written as X ~ B (alpha, beta)). Odako, in contrast to the normal, always bell-shaped, beta distribution has much greater flexibility. In particular, for alpha = 1 and beta = 1, this distribution turns into a uniform distribution (dark blue line in the figure below), for alpha <1 and beta <1, the distribution function takes the form of a well (green line), and for alpha> 1 and beta> 1 becomes similar to normal (red and light blue lines).

In addition, the beta distribution has several interesting properties:

But what if we use as the parameters alpha and beta, respectively, the number of positive and the number of negative user ratings? In this case, initially the beta distribution can be initialized to units for both parameters (which, generally speaking, will correspond to Laplace smoothing ). In this case, initially our assessment regarding the quality of the film will be evenly distributed (we don’t know anything about it), and each voice will increase one of the parameters, reduce the variance and shift the graph either to the right (alpha parameter, positive feedback) or to the left (beta parameter , negative reviews). Moreover, our assessment of the quality of the film will never go beyond the interval [0..1] and, in fact, will show the likelihood that the film will be likedaverage viewer .

Let's look at a few examples. When a new film appears , about which no one has yet expressed an opinion, its alpha and beta parameters will be equal to one, and the density graph will be equivalent to the uniform distribution graph:

It turned out that the director downloaded the information about the film. I downloaded it myself and voted it myself. Naturally, positively. Yes, and five of his assistants asked for help. The result: alpha = 1 + 1 + 5 = 7, beta = 1.

The ex-wife of the director saw the page of the film and decided to spoil the rating, having voted negatively with her lover. The result: alpha = 7, beta = 1 + 2 = 3:

After 8 votes, the average score, taking into account the Laplace smoothing, will be equal to alpha / (alpha + beta) = 7/10 = 0.7. However, the graph shows that the variance of the resulting distribution is still high, which means that our confidence in such an estimate is low.

Suppose that during the first week of rental 90 more people voted for the film, and so that the alpha parameter in the end turned out to be 70, and beta - 30. The average rating will be, as before, equal to 70/100 = 0.7, but the graph is significantly will change:

The variance in the second graph is much smaller. Those. as the number of votes increases, our confidence in evaluating the quality of the film also increases.

All this is good, but the user does not want to see any strange graphics. He needs a rating - a figure by which he can determine whether to watch a movie~~or better go read a book~~ . In principle, having the beta distribution parameters, you can calculate the average estimate and variance, and somehow try to combine them (for example, divide the average estimate by the logarithm of the variance). But you can go in a more statistically correct way.

To make the conversation more substantive, we take for example 2 films: one from the previous section with the distribution of B (70, 30) and the other, more popular , with the distribution of B (650, 350). Distribution graphs are shown in the figure below:

On the one hand, the average ratings for the first film are higher - 0.7 versus 0.65. However, a lot more people watched the second film, so it is still unknown what the rating of the first film would be after the same number of reviews. So how do you compare them?

One of the comparison options is to calculate the

After calculations, it turned out that with a 95% probability, the first film will eventually appeal to at least 0.6227 from all viewers, but the second - 0.6250 of them. The difference is only two thousandths, but if you use these ratings, the second film, even with a lower average rating, will be higher in the list.

The same calculations can be repeated for the films indicated at the very beginning of the post: for a film with a ratio of 80/20, the minimum confidence quality will be 0.731, and for a film with a ratio of 9/1 - 0.717, i.e. the number of votes again outweighs the average score. However, it is worth adding to the second film only one vote in favor, and our coefficient for it becomes 0.741, putting it in first place.

All coefficients indicated here are, by and large, taken by eye. Although, it seems, they give a fairly sane result, in a real application it makes sense to try different values for them. For example, with a large number of users voting for films, it makes sense to increase the parameters not by 1, but, for example, by 0.5 for each vote. Or even enter the attenuation coefficient when each next voice has less weight than the previous one - in this way it is possible to slow down the growth of the coefficients.

In addition, you can improve the initial assessment of the film. In this article, I proceeded from the fact that initially we did not know anything about the film itself or about other films in our system, so at the beginning the film is assigned a uniform distribution (alpha = 1, beta = 1). However, in practice, as a rule, we already know something about the film in advance and can use this information as an a priori estimate. For example, we can calculate the average rating for previous films of this director and initialize the beta distribution parameters accordingly. Even if we don’t know anything about the director (producer, screenwriter, cast), we can use the average rating for all films in our database.

In principle, the method can be extended to more graded ratings, for example, for a scale from 0 to 10. In this case, ratings above 5 will be added to the parameter alpha, below 5 to beta, and when evaluating exactly 5, both alpha and beta increase by 0.5 (hello Habr!).

Finally, you can vary the required degree of confidence in the answer or even change the approach, using instead of the minimum confidence quality, the area under the graph inside a certain fixed interval.

#### Normal distribution approximation

In general, the wording itself - to evaluate a certain value and our confidence in it - suggests the use of a probability distribution model, for example, a normal distribution.

**What is a normal distribution ?!**

For those who skipped couples mat. statistics, I recall that it is a normal distribution, and indeed the probability distribution. Suppose we came to a stop and saw how a bus left right in front of us. We know that the next one will arrive in about 15 minutes (at the 15th minute). Well, maybe on the 16th. Or vice versa, on the 14th. In principle, the driver can hurry up and arrive already at 12 minutes, but the likelihood of this is much lower. The graph below shows the probability distribution of the bus arrival every minute: most likely it will arrive in the 15th minute, with a slightly lesser probability - in the 14th or 16th, and with very little probability in the 12th or 18th.

It should be understood that the value along the Y axis is not probability, but

The normal distribution is characterized by two parameters: the average value (mean, here is 15 minutes) and the variance (variance, spread), which shows the degree of uncertainty of the average value: the greater the variance, the wider the graph, and the less we are sure when, finally this bus will come.

It should be understood that the value along the Y axis is not probability, but

*probability density*(probability density function, PDF). The probability itself is calculated as the area under the graph between the two values X1 and X2, for example, the probability that the bus will arrive between 15 and 16 minutes in this case is 0.248. But more on that later.The normal distribution is characterized by two parameters: the average value (mean, here is 15 minutes) and the variance (variance, spread), which shows the degree of uncertainty of the average value: the greater the variance, the wider the graph, and the less we are sure when, finally this bus will come.

Rating, as a rule, is just a number, some kind of final result of evaluation. And in fact, we will evaluate the estimated quality of the film (coffee grinder, article, user - you need to emphasize). The graph below shows the distribution graphs for two hypothetical films .

The first film (blue line) caused conflicting reviews (the average distribution value is 0.5). In contrast, the second film (the green line) received more positive than negative ratings, however, fewer people voted, therefore, as a result, we are much less confident (variance is much greater than in the first graph).

In principle, a normal distribution in itself already allows a good rating simulation (a theoretical justification for this givescentral limit theorem ). However, statistics have a more convenient tool for this.

#### Beta distribution

Like the normal one, the beta distribution is defined by two parameters - alpha> 0 and beta> 0 (written as X ~ B (alpha, beta)). Odako, in contrast to the normal, always bell-shaped, beta distribution has much greater flexibility. In particular, for alpha = 1 and beta = 1, this distribution turns into a uniform distribution (dark blue line in the figure below), for alpha <1 and beta <1, the distribution function takes the form of a well (green line), and for alpha> 1 and beta> 1 becomes similar to normal (red and light blue lines).

**programming exercise**

It would be unfair to continue to show graphs and not to tell how to draw them and play with the parameters, so here and below I will show code examples for generating each image. The examples will be in Python using the NumPy, SciPy and matplotlib libraries (all three are available from pip), however, they can be easily transferred to R , Matlab / Octave , Java and even JavaScript .

For all examples, the following imports will be needed:

The previous chart was generated with the following code:

For all examples, the following imports will be needed:

```
from numpy import *
import scipy.stats as ss
import pylab as plt
```

The previous chart was generated with the following code:

```
x = arange(101) / 100.
plt.plot(x, ss.beta(1, 1).pdf(x))
plt.plot(x, ss.beta(.7, .7).pdf(x))
plt.plot(x, ss.beta(5, 5).pdf(x))
plt.plot(x, ss.beta(10, 5).pdf(x))
plt.show()
```

In addition, the beta distribution has several interesting properties:

- It is limited to a finite interval. If we want to "lock" the possible values in the range from 0 to 1, then the beta distribution is just what we need.
- It is symmetrical with respect to its parameters. Graph B (alpha, beta) will be a mirror image of graph B (beta, alpha).
- alpha and beta act on opposite sides of the density graph. With increasing alpha, the graph shifts and tilts to the right, while increasing beta - on the contrary, to the left.
- The dispersion decreases with increasing any of the parameters.

#### User ratings

But what if we use as the parameters alpha and beta, respectively, the number of positive and the number of negative user ratings? In this case, initially the beta distribution can be initialized to units for both parameters (which, generally speaking, will correspond to Laplace smoothing ). In this case, initially our assessment regarding the quality of the film will be evenly distributed (we don’t know anything about it), and each voice will increase one of the parameters, reduce the variance and shift the graph either to the right (alpha parameter, positive feedback) or to the left (beta parameter , negative reviews). Moreover, our assessment of the quality of the film will never go beyond the interval [0..1] and, in fact, will show the likelihood that the film will be likedaverage viewer .

Let's look at a few examples. When a new film appears , about which no one has yet expressed an opinion, its alpha and beta parameters will be equal to one, and the density graph will be equivalent to the uniform distribution graph:

**programming exercise**

```
# продолжая предыдущий пример
plt.plot(x, ss.beta(1, 1).pdf(x))
plt.show()
```

It turned out that the director downloaded the information about the film. I downloaded it myself and voted it myself. Naturally, positively. Yes, and five of his assistants asked for help. The result: alpha = 1 + 1 + 5 = 7, beta = 1.

**programming exercise**

```
# всё аналогично
plt.plot(x, ss.beta(7, 1).pdf(x))
plt.show()
```

The ex-wife of the director saw the page of the film and decided to spoil the rating, having voted negatively with her lover. The result: alpha = 7, beta = 1 + 2 = 3:

**programming exercise**

```
plt.plot(x, ss.beta(7, 3).pdf(x))
plt.show()
```

After 8 votes, the average score, taking into account the Laplace smoothing, will be equal to alpha / (alpha + beta) = 7/10 = 0.7. However, the graph shows that the variance of the resulting distribution is still high, which means that our confidence in such an estimate is low.

Suppose that during the first week of rental 90 more people voted for the film, and so that the alpha parameter in the end turned out to be 70, and beta - 30. The average rating will be, as before, equal to 70/100 = 0.7, but the graph is significantly will change:

**programming exercise**

```
plt.plot(x, ss.beta(70, 30).pdf(x))
plt.show()
```

The variance in the second graph is much smaller. Those. as the number of votes increases, our confidence in evaluating the quality of the film also increases.

#### Rating

All this is good, but the user does not want to see any strange graphics. He needs a rating - a figure by which he can determine whether to watch a movie

To make the conversation more substantive, we take for example 2 films: one from the previous section with the distribution of B (70, 30) and the other, more popular , with the distribution of B (650, 350). Distribution graphs are shown in the figure below:

**programming exercise**

```
plt.plot(x, ss.beta(70, 30).pdf(x))
plt.plot(x, ss.beta(650, 350).pdf(x))
plt.show()
```

On the one hand, the average ratings for the first film are higher - 0.7 versus 0.65. However, a lot more people watched the second film, so it is still unknown what the rating of the first film would be after the same number of reviews. So how do you compare them?

One of the comparison options is to calculate the

*minimum confidence quality of the*film, a number showing the minimum rating that the film can receive after an infinite number of reviews. In statistics, it is not customary to bring everything to the absolute, therefore, as the level of confidence, we will take not the standard 100%, but the standard 95%. This means that we want to be 95% sure that the film is*no worse*than X. Graphically, this means that 95% of the area under the graph should be to the right of X:**programming exercise**

Almost all statistical libraries for all implemented distributions provide a cumulative probability function (CDF), which takes a value as input and returns the probability that a random variable will be

First, we need an area on the other hand, from X to 1. Fortunately, as mentioned above, the beta function is symmetric with respect to its parameters, so instead of the direct beta distribution of B (alpha, beta), we can work with the inverse - B (beta, alpha).

Secondly, we need a function that, for a given degree of confidence (a percentage of the entire area of the graph), returns the desired value of X. Most often, in the mat. In packets, this function is called inverse CDF or something like that, but SciPy uses the name PPF (percent point function, also found in the literature under the name quantile funtion).

Total, to get the value of the minimum confidence quality of the film, you can use the following code:

*less than this value*. Those. in fact, the CDF function of some value of X returns the area under the graph between zero and X. This differs from what we need in two aspects.First, we need an area on the other hand, from X to 1. Fortunately, as mentioned above, the beta function is symmetric with respect to its parameters, so instead of the direct beta distribution of B (alpha, beta), we can work with the inverse - B (beta, alpha).

Secondly, we need a function that, for a given degree of confidence (a percentage of the entire area of the graph), returns the desired value of X. Most often, in the mat. In packets, this function is called inverse CDF or something like that, but SciPy uses the name PPF (percent point function, also found in the literature under the name quantile funtion).

Total, to get the value of the minimum confidence quality of the film, you can use the following code:

```
dist1 = ss.beta(70, 30) # распределение для первого фильма, просто для справки
dist2 = ss.beta(650, 350) # распределение для второго фильма, тоже просто для справки
idist1 = ss.beta(30, 70) # распределение, обратное первому
idist2 = ss.beta(350, 650) # распределение, обратное второму
q1 = 1 - idist1.ppf(.95) # минимальное качество первого фильма с уровнем уверенности 95%
q2 = 1 - idist2.ppf(.95) # минимальное качество второго фильма с уровнем уверенности 95%
...
>>> q1
0.62272854953840073
>>> q2
0.62503161244929017
```

After calculations, it turned out that with a 95% probability, the first film will eventually appeal to at least 0.6227 from all viewers, but the second - 0.6250 of them. The difference is only two thousandths, but if you use these ratings, the second film, even with a lower average rating, will be higher in the list.

The same calculations can be repeated for the films indicated at the very beginning of the post: for a film with a ratio of 80/20, the minimum confidence quality will be 0.731, and for a film with a ratio of 9/1 - 0.717, i.e. the number of votes again outweighs the average score. However, it is worth adding to the second film only one vote in favor, and our coefficient for it becomes 0.741, putting it in first place.

#### Variations, advantages and disadvantages

All coefficients indicated here are, by and large, taken by eye. Although, it seems, they give a fairly sane result, in a real application it makes sense to try different values for them. For example, with a large number of users voting for films, it makes sense to increase the parameters not by 1, but, for example, by 0.5 for each vote. Or even enter the attenuation coefficient when each next voice has less weight than the previous one - in this way it is possible to slow down the growth of the coefficients.

In addition, you can improve the initial assessment of the film. In this article, I proceeded from the fact that initially we did not know anything about the film itself or about other films in our system, so at the beginning the film is assigned a uniform distribution (alpha = 1, beta = 1). However, in practice, as a rule, we already know something about the film in advance and can use this information as an a priori estimate. For example, we can calculate the average rating for previous films of this director and initialize the beta distribution parameters accordingly. Even if we don’t know anything about the director (producer, screenwriter, cast), we can use the average rating for all films in our database.

In principle, the method can be extended to more graded ratings, for example, for a scale from 0 to 10. In this case, ratings above 5 will be added to the parameter alpha, below 5 to beta, and when evaluating exactly 5, both alpha and beta increase by 0.5 (hello Habr!).

Finally, you can vary the required degree of confidence in the answer or even change the approach, using instead of the minimum confidence quality, the area under the graph inside a certain fixed interval.

**Beta distribution chart for this article.**