The book "Probabilistic programming in Python: Bayesian inference and algorithms"

    imageHi, habrozhiteli! Bayesian methods scare the formulas of many IT specialists, but now you can’t do without analysis of statistics and probabilities. Cameron Davidson-Pylon talks about the Bayesian method from the point of view of a practical programmer working with the multifunctional language PyMC and the libraries NumPy, SciPy and Matplotlib. Revealing the role of Bayesian conclusions in A / B testing, identifying fraud and other urgent tasks, you will not only easily understand this non-trivial topic, but also begin to apply the acquired knowledge to achieve your goals.

    Excerpt: 4.3.3. Example: sorting comments on Reddit

    Perhaps you do not agree that the law of large numbers is applied by all, although only implicitly, in subconscious decision making. Consider the example of online product ratings. Do you often trust an average rating of five points based on one review? Two reviews? Three reviews? You subconsciously understand that with such a small number of reviews, the average rating does not reflect well how good or bad the product is.

    As a result of this, there are omissions when sorting goods and generally when comparing them. For many buyers, it’s clear that sorting the results of an interactive search by rating is not very objective, it doesn’t matter if we are talking about books, videos or comments on the Internet. Often, movies or comments in the first places get high marks only due to a small number of enthusiastic fans, and really good films or comments are hidden on subsequent pages with supposedly imperfect ratings of about 4.8. What to do about it?

    Consider the popular site Reddit (I deliberately do not provide links to it, because Reddit is notorious for attracting users, and I'm afraid that you will never return to my book). On this site there are many links to different stories and pictures, and comments on these links are also very popular. Users of the site (which is usually called the word redditor1) can vote for or against each comment (the so-called upvotes and downvotes). Reddit sorts comments by default in descending order. How to determine which comments are the best? They usually focus on the following several indicators.

    1. Popularity. A comment is considered good if a lot of votes are cast. Problems when using this model begin in the case of a comment with hundreds of votes for and thousands of against. Although very popular, this commentary seems too ambiguous to be considered "best."

    2. The difference . You can take advantage of the difference between the number of votes in favor and against. This solves the problem that arises when using the “popularity” metric, but does not take into account the temporary nature of the comments. Comments can be sent many hours after the publication of the original link. At the same time, a bias arises, due to which the highest rating is not received at all by the best comments, but the oldest, who managed to accumulate more votes in favor than newer ones.

    3. Time adjustment. Consider a method in which the difference between the pros and cons is divided by the age of the comment and a frequency is obtained, for example, the difference in per second or per minute. A counterexample immediately comes to mind: when using the “per second” option, a comment left a second ago with one “yes” vote will be better than left 100 seconds ago with 99 votes “yes”. This problem can be avoided if you consider only comments left at least t seconds ago. But how to choose a good value of t? Does this mean that all comments posted later than t seconds ago are bad? The case will end with a comparison of unstable values ​​with stable (new and old comments).

    4. Value. The ranking of comments on the ratio of the number of votes in favor to the total number of votes in favor and against. This approach eliminates the problem with the temporal nature of comments, so that recently posted comments with good marks will receive a high rating with the same probability as those left long ago, provided that they have a relatively high ratio of votes to the total number of votes. The problem with this method is that a comment with one vote in favor (ratio = 1.0) will be better than a comment with 999 votes in favor and one against (ratio = 0.999), although it’s obvious that the second of these comments is likely to be the best.

    I wrote “rather” for a reason. It may turn out that the first comment with a single yes vote is really better than the second, with 999 yes votes. It is difficult to agree with this statement, because we do not know what 999 potential next votes could be for the first comment. Say, he could get as a result of another 999 votes in favor and not a single vote against and be better than the second, although such a scenario is not very likely.

    In fact, we need to evaluate the actual ratio of votes in favor. I note that this is not at all the same as the observed correlation of votes in favor; the actual ratio of votes is hidden, we observe only the number of votes in favor as compared with votes against (the actual ratio of votes can be considered as the probability of this comment receiving a vote of yes, not against). Thanks to the law of large numbers, it is safe to say that in a comment with 999 votes in favor and one against, the actual ratio of votes is likely to be close to 1. On the other hand, we are much less confident about how it turns out The actual ratio of votes for the comment with one vote in favor. This seems to be a Bayesian problem.

    One way to determine the a priori distribution of the affirmative votes is to study the history of the distribution of the affirmative votes. This can be done by scraping Reddit comments and then defining the distribution. However, this method has several drawbacks.

    1. Asymmetric data. The absolute majority of comments has a very small number of votes, as a result of which the ratios of many comments will be close to extreme (see the “triangular” graph in the example with the Kaggle dataset in Fig. 4.4) and the distribution will be strongly “skewed”. You can try to consider only comments whose number of votes exceeds a certain threshold value. But here difficulties arise. One has to look for a balance between the number of available comments, on the one hand, and a higher threshold value with the corresponding accuracy of the ratio, on the other.

    2. Biased (containing systematic error) data. Reddit consists of many sub-forums (subreddits). Two examples: r / aww with pictures of funny animals and r / politics. It is more than likely that the behavior of users when commenting on these two Reddit subforums will radically differ: in the first of them, visitors are most likely to be touched and friendly, which will lead to a greater number of votes in favor, compared to the second, where opinions in the comments are likely to diverge.

    In light of the foregoing, it seems to me that it makes sense to use a uniform a priori distribution.

    Now we can calculate the posterior distribution of the actual ratio of votes in favor. The script is used to scrap comments from the current most popular Reddit image. In the following code, we scraped Reddit comments related to the image [3]:

    from IPython.core.display import Image
    # С помощью добавления числа к вызову %run
    # можно получить i-ю по рейтингу фотографию.
    %run 2

    Title of submission:
    Frozen mining truck

    Contents: массив текстов всех комментариев к картинке Votes: двумерный
    массив NumPy голосов "за" и "против" для каждого комментария
    n_comments = len(contents)
    comments = np.random.randint(n_comments, size=4)
    print "Несколько комментариев (из общего числа в %d) \n
    for i in comments:
         print '"' + contents[i] + '"'
         print "голоса "за"/"против": ",votes[i,:]

    Несколько комментариев (из общего числа 77)
    "Do these trucks remind anyone else of Sly Cooper?"
    голоса "за"/"против": [2 0]
    "Dammit Elsa I told you not to drink and drive."
    голоса "за"/"против": [7 0]
    "I've seen this picture before in a Duratray (the dump box supplier) brochure..."
    голоса "за"/"против": [2 0]
    "Actually it does not look frozen just covered in a layer of wind packed snow."
    голоса "за"/"против": [120 18]

    With N votes and a given actual ratio of votes “for” p, the number of votes “for” resembles a binomial random variable with parameters p and N (the fact is that the actual ratio of votes “for” is equivalent to the probability of casting a vote for “in comparison with the vote“ against ”with N possible votes / trials). We create a function for the Bayesian derivation of p with respect to the set of votes “for” / “against” a particular comment.

    import pymc as pm
    def posterior_upvote_ratio(upvotes, downvotes, samples=20000):
          Эта функция принимает в качестве параметров количество
          голосов "за" и "против", полученных конкретным комментарием,
          а также количество выборок, которое нужно вернуть пользователю.
          Предполагается, что априорное распределение равномерно.
          N = upvotes + downvotes
          upvote_ratio = pm.Uniform("upvote_ratio", 0, 1)
          observations = pm.Binomial("obs", N, upvote_ratio,
                                                    value=upvotes, observed=True)
          # Обучение; сначала выполняем метод MAP, поскольку он не требует
          # больших вычислительных затрат и приносит определенную пользу.
          map_ = pm.MAP([upvote_ratio, observations]).fit()
          mcmc = pm.MCMC([upvote_ratio, observations])
          mcmc.sample(samples, samples/4)
          return mcmc.trace("upvote_ratio")[:]

    The following are the resulting posterior distributions.

    figsize(11., 8)
    posteriors = []
    colors = ["#348ABD", "#A60628", "#7A68A6", "#467821", "#CF4457"]
    for i in range(len(comments)):
         j = comments[i]
         label = u'(%d за:%d против)\n%s...'%(votes[j, 0], votes[j,1],
         posteriors.append(posterior_upvote_ratio(votes[j, 0], votes[j,1]))
         plt.hist(posteriors[i], bins=18, normed=True, alpha=.9,
                    histtype="step", color=colors[i%5], lw=3, label=label)
         plt.hist(posteriors[i], bins=18, normed=True, alpha=.2,
                    histtype="stepfilled", color=colors[i], lw=3)
    plt.legend(loc="upper left")
    plt.xlim(0, 1)
    plt.xlabel(u"Вероятность голоса 'за'")
    plt.title(u"Апостериорные распределения соотношений голосов 'за' \
               для различных комментариев");

    [****************100%******************] 20000 of 20000 complete

    As can be seen from fig. 4.5, some distributions are strongly “squeezed”, while others have relatively long “tails”, expressing that we do not know exactly what the actual ratio of votes is for.


    4.3.4. Sorting

    So far, we have ignored the main goal of our example: sorting comments from best to worst. Of course, it is impossible to sort the distributions; sort need scalar values. There are many ways to extract the essence of the distribution in the form of a scalar; for example, the essence of a distribution can be expressed in terms of its mathematical expectation, or average value. However, the average value for this is not suitable, since this indicator does not take into account the uncertainty of distributions.

    I would recommend using the 95% least plausible value, which is defined as a value with only a 5% probability that the actual value of the parameter is below it (cf. the lower bound of the Bayesian confidence interval). Next, we plot the posterior distributions with the indicated 95% least likely value (Fig. 4.6).


    N = posteriors[0].shape[0]
    lower_limits = []
    for i in range(len(comments)):
         j = comments[i]
         label = '(%d за:%d против)\n%s…'%(votes[j, 0], votes[j,1],
         plt.hist(posteriors[i], bins=20, normed=True, alpha=.9,
                    histtype="step", color=colors[i], lw=3, label=label)
         plt.hist(posteriors[i], bins=20, normed=True, alpha=.2,
                    histtype="stepfilled", color=colors[i], lw=3)
         v = np.sort(posteriors[i])[int(0.05*N)]
         plt.vlines(v, 0, 10 , color=colors[i], linestyles="—",
    plt.legend(loc="upper left")
    plt.xlabel(u"Вероятность голоса 'за'")
    plt.title(u"Апостериорные распределения соотношений голосов 'за' \
               для различных комментариев");
    order = np.argsort(-np.array(lower_limits))
    print order, lower_limits

    [3 1 2 0] [0.36980613417267094, 0.68407203257290061,
          0.37551825562169117, 0.8177566237850703]

    The best, according to our procedure, will be those comments for which the highest probability of receiving a high percentage of yes votes. Visually, these are comments with the closest to the unit 95% least plausible value. In fig. 4.6 The 95% least plausible value is depicted using vertical lines.

    Why is sorting based on this metric such a good idea? Sorting according to the 95% least plausible value means maximum caution in declaring comments to be the best. That is, even in the worst case scenario, if we strongly overestimate the ratio of votes in favor, it is guaranteed that the best comments will be on top. With this ordering, the following very natural properties are provided.

    1. Of the two comments with the same observed ratio of votes “for”, the comment with the higher number of votes will be recognized as the best (since confidence is higher in the higher ratio for him).

    2. Of the two comments with the same number of votes, the best is considered the comment with a greater number of votes in favor.

    »More information about the book can be found on the publisher’s website
    » Contents
    » Excerpt

    For Khabrozhiteley 25% discount on the coupon - JavaScript

    After the payment of the paper version of the book, an electronic book is sent by e-mail.

    Also popular now: