blognetology July 18, 2019 at 12:01

Intuitive use of Monte Carlo methods with Markov chains

Transfer

Is it easy? I tried

Alexey Kuzmin, director of data development and work at DomKlik, a lecturer in Data Science at Netology, translated an article by Rahul Agarwal on how Monte Carlo methods work with Markov chains to solve problems with a large state space.

Everyone associated with Data Science has ever heard of Monte Carlo methods with Markov chains (MCMCs). Sometimes the topic is touched upon when studying Bayesian statistics, sometimes when working with tools like Prophet.

But MCMC is hard to understand. Every time I read about these methods, I noticed that the essence of MCMC is hidden in the deep layers of mathematical noise, and it is hard to make out behind this noise. I had to spend many hours understanding this concept.

In this article, an attempt to explain Monte Carlo methods with Markov chains is available, so that it becomes clear what they are used for. I will focus on a few more ways to use these methods in my next post.

So, let's begin. MCMC consists of two terms: Monte Carlo and Markov chains. Let's talk about each of them.

Monte Carlo

In the simplest terms , Monte Carlo methods can be defined as simple simulations.

Monte Carlo Methods got their name from Monte Carlo Casino in Monaco. In many card games you need to know the probability of winning the dealer. Sometimes the calculation of this probability can be mathematically complicated or intractable. But we can always run a computer simulation to play the whole game many times and consider probability as the number of victories divided by the number of games played.
This is all you need to know about Monte Carlo methods. Yes, it's just a simple modeling technique with a fancy name.

Markov chains

Since the term MCMC consists of two parts, you still need to understand what Markov chains are . But before moving on to the Markov chains, let's talk a little about the Markov properties.

Suppose there is a system of M-possible states, and you move from one state to another. Let nothing confuse you yet. A specific example of such a system is weather, which changes from hot to cold to moderate. Another example is the stock market, which is jumping from a bearish to a bullish and stagnant state.

The Markov property suggests that for a given process that is in state X _n at a particular moment in time, the probability X _{n + 1}= k (where k is any of the M-states into which the process can go) depends only on what this state is at the moment. And not about how it reached its current state.
In mathematical language, we can write this in the form of the following formula:

For clarity, you do not care about the sequence of conditions that the market adopted in order to become “bullish”. The probability that the next state will be “bearish” is determined only by the fact that the market is currently in a “bullish” state. It also makes sense in practice.

A process with a Markov property is called a Markov process. Why is the Markov chain important? Due to its stationary distribution.

What is stationary distribution?

I will try to explain the stationary distribution by calculating it for the example below. Suppose you have a Markov process for the stock market, as shown below.

You have a transition probability matrix that determines the probability of a transition from state X _i to X _j .

Matrix of transition probabilities, Q

In the given matrix of transition probabilities Q, the probability that the next state will be “bull”, given the current state of “bull” = 0.9; the probability that the next state will be “bearish” if the current state is “bull” = 0.075. Etc.

Well, let's start with some particular state. Our state will be set by the vector [bull, bear, stagnation]. If we start with a “bearish” state, the vector will be like this: [0,1,0]. We can calculate the probability distribution for the next state by multiplying the current state vector by the transition probability matrix.

Note that the probabilities give a total of 1.

The following distribution of states can be found by the formula:

And so on. In the end, you will reach a stationary state in which the state stabilizes:

For the transition probability matrix Q described above, the stationary distribution s is as follows:

You can obtain the stationary distribution using the following code:

Q = 
np.matrix([[0.9,0.075,0.025],[0.15,0.8,0.05],[0.25,0.25,0.5]])
init_s = np.matrix([[0, 1 , 0]])
epsilon =1
while epsilon>10e-9:
   next_s = np.dot(init_s,Q)
   epsilon = np.sqrt(np.sum(np.square(next_s - init_s)))
   init_s = next_s
print(init_s)
------------------------------------------------------------------
matrix([[0.62499998, 0.31250002, 0.0625    ]])

You can also start from any other state — achieve the same stationary distribution. Change the initial state in the code if you want to make sure of this.

Now we can answer the question of why the stationary distribution is so important.

Stationary distribution is important because it can be used to determine the probability of a system being in a certain state at random.

For our example, we can say that in 62.5% of cases the market will be in a “bullish” state, 31.25% in a “bearish” state and 6.25% in stagnation.

Intuitively, you can see this as a random wandering around the chain.

Random walk

You are at a certain point and choose the next state, observing the probability distribution of the next state, taking into account the current state. We can visit some nodes more often than others, based on the probabilities of these nodes.

This is how Google solved the search problem at the dawn of the Internet. The problem was sorting the pages, depending on their importance. Google solved the problem using the Pagerank algorithm. The Google Pagerank algorithm should consider state as a page, and the probability of a page in a stationary distribution as its relative importance.

Now we turn directly to the consideration of MCMC methods.

What are Monte Carlo Methods with Markov Chains (MCMC)

Before answering what MCMC is, let me ask one question. We know about beta distribution. We know its probability density function. But can we take a sample from this distribution? Can you come up with a way to do this?

Think ...

MCMC gives you the opportunity to choose from any probability distribution. This is especially important when you need to make a selection from the posterior distribution.

The figure shows the Bayesian theorem.

For example, you need to make a sample from the posterior distribution. But is it easy to calculate the posterior component together with the normalizing constant (evidence)? In most cases, you can find them in the form of a product of likelihood and a priori probability. But to calculate the normalizing constant (p (D)) does not work. Why? Let's take a closer look.

Suppose that H takes only 3 values:

p (D) = p (H = H1) .p (D | H = H1) + p (H = H2) .p (D | H = H2) + p (H = H3 ) .p (D | H = H3)

In this case, p (D) is easy to calculate. But what if the value of H is continuous? Would it be possible to calculate this as easily, especially if H took on infinite values? To do this, a complex integral would have to be solved.

We want to make random selection from the posterior distribution, but also want to consider p (D) as a constant.

Wikipedia writes :

Monte Carlo methods with Markov chains is a class of algorithms for sampling from a probability distribution, based on the construction of a Markov chain, which as a stationary distribution has the desired form. The chain state after a series of steps is then used as a selection from the desired distribution. Sampling quality improves with increasing number of steps.

Let's look at an example. Let's say you need a sample from the beta distribution . Its density:

where C is the normalizing constant. Actually, this is some function of α and β, but I want to show that it is not needed for a sample from the beta distribution, so we will take it as a constant.

The beta distribution problem is really difficult, if not practically insoluble. In reality, you may have to work with more complex distribution functions, and sometimes you will not know the normalization constants.

MCMC methods make life easier by providing algorithms that could create a Markov chain that has a beta distribution as a stationary distribution, given that we can choose from a uniform distribution (which is relatively simple).

If we start with a random state and move on to the next state on the basis of some algorithm several times, we will ultimately create a Markov chain with a beta distribution as a stationary distribution. And the states that we find ourselves in a long time can be used as a sample from the beta distribution.

One of these MCMC algorithms is the Metropolis-Hastings algorithm.

Metropolis-Hastings Algorithm

Intuition:

So what is the purpose?

Intuitively, we want to walk along some piece of surface (our Markov chain) in such a way that the amount of time we spend at each location is proportional to the height of the surface at that place (the desired probability density from which we want to make a selection).

So, for example, we would like to spend twice as much time on a hilltop 100 meters high than on a neighboring 50-meter hill. It’s good that we can do this even if we don’t know the absolute heights of the points on the surface: all you need to know is the relative heights. For example, if the top of the hill A is two times higher than the top of the hill B, then we would like to spend twice as much time in A than on B.

There are more complex schemes for proposing new places and rules for their adoption, but the main idea is as follows:

Choose a new "suggested" location.
Find out how higher or lower this location is compared to the current one.
Staying in place or moving to a new place with a probability proportional to the heights of the places.

The purpose of the MCMC is to select from some probability distribution without having to know its exact height at any point (no need to know C).
If the “wandering” process is set up correctly, you can make sure that this proportionality (between the time spent and the distribution height) is achieved .

Algorithm:

Now let's define and describe the task in more formal terms. Let s = (s1, s2, ..., sM) be the desired stationary distribution. We want to create a Markov chain with such a stationary distribution. We start with an arbitrary Markov chain with M-states with the transition matrix P, so that pij represents the probability of transition from state i to j.

Intuitively, we know how to roam the Markov chain, but the Markov chain does not have the required stationary distribution. This chain has some stationary distribution (which we do not need). Our goal is to change the way we wander around the Markov chain so that the chain has the desired stationary distribution.

To do this:

Start with a random initial state i.
Randomly select a new assumed state by looking at the transition probabilities in the ith row of the transition matrix P.
Calculate a measure called decision probability, which is defined as: aij = min (sj.pji / si.pij, 1).
Now toss a coin, which lands on the surface of the eagle with probability aij. If an eagle falls, accept the offer, that is, go to the next state, otherwise, reject the offer, that is, remain in the current state.
Repeat many times.

After a large number of tests, this chain will converge and have a stationary distribution s. Then we can use chain states as a sample from any distribution.

By doing this to sample the beta distribution, the only time you have to use probability density is to search for the probability of making a decision. To do this, divide sj by si (that is, the normalizing constant C is canceled).

Beta Selection

Now we turn to the problem of sampling from the beta distribution.

A beta distribution is a continuous distribution on [0,1] and can have infinite values on [0,1]. Suppose that an arbitrary Markov chain P with infinite states on [0,1] has a transition matrix P such that pij = pji = all elements in the matrix.

We do not need Matrix P, as we will see later, but I want the description of the problem to be as close as possible to the algorithm that we proposed.

Start with a random initial state i obtained from a uniform distribution on (0,1).
Randomly select a new assumed state by looking at the transition probabilities in the ith row of the transition matrix P. Suppose we choose another state Unif (0,1) as the assumed state j.
Calculate the measure, which is called the probability of making a decision:

Which simplifies to:

Since pji = pij, and where

Now toss a coin. With probability aij an eagle will fall. If an eagle falls, then you should accept the offer, that is, move to the next state. Otherwise, it is worth rejecting the offer, that is, remaining in the same state.
Repeat the test many times.

The code:

It's time to move from theory to practice. We will write our beta sample in Python.

impo
rt
rand
om
# Lets define our Beta Function to generate s for any particular state. We don't care for the normalizing constant here.
def beta_s(w,a,b):
	return w**(a-1)*(1-w)**(b-1)
# This Function returns True if the coin with probability P of heads comes heads when flipped.
def random_coin(p):
	unif = random.uniform(0,1)
	if unif>=p:
    	return False
	else:
    	return True
# This Function runs the MCMC chain for Beta Distribution.
def beta_mcmc(N_hops,a,b):
	states = []
	cur = random.uniform(0,1)
	for i in range(0,N_hops):
    	states.append(cur)
    	next = random.uniform(0,1)
    	ap = min(beta_s(next,a,b)/beta_s(cur,a,b),1) # Calculate the acceptance probability
    	if random_coin(ap):
        	cur = next
	return states[-1000:] # Returns the last 100 states of the chain

Compare the results with the actual beta distribution.

impo
rt
num
py
as
np
import pylab as pl
import scipy.special as ss
%matplotlib inline
pl.rcParams['figure.figsize'] = (17.0, 4.0)
# Actual Beta PDF.
def beta(a, b, i):
	e1 = ss.gamma(a + b)
	e2 = ss.gamma(a)
	e3 = ss.gamma(b)
	e4 = i ** (a - 1)
	e5 = (1 - i) ** (b - 1)
	return (e1/(e2*e3)) * e4 * e5
# Create a function to plot Actual Beta PDF with the Beta Sampled from MCMC Chain.
def plot_beta(a, b):
	Ly = []
	Lx = []
	i_list = np.mgrid[0:1:100j]
	for i in i_list:
    	Lx.append(i)
        Ly.append(beta(a, b, i))
    pl.plot(Lx, Ly, label="Real Distribution: a="+str(a)+", b="+str(b))
    pl.hist(beta_mcmc(100000,a,b),normed=True,bins =25, histtype='step',label="Simulated_MCMC: a="+str(a)+", b="+str(b))
	pl.legend()
	pl.show()
plot_beta(0.1, 0.1)
plot_beta(1, 1)
plot_beta(2, 3)

As you can see, the values are very similar to the beta distribution. Thus, the MCMC network has reached a stationary state.

In the above code, we created a beta sampler, but the same concept applies to any other distribution from which we want to make a selection.

conclusions

It was a great post. Congratulations if you read it to the end.

In essence, MCMC methods can be complex, but they provide us with great flexibility. You can select from any distribution function using the selection through the MCMC. Typically, these methods are used to sample from posterior distributions.

You can also use MCMC to solve problems with a large state space. For example, in a backpack problem or for decryption. I will try to provide you with more interesting examples in the next post. Keep for updates.

From the editors

Online course « the Python for working with data "
Online course " Machine Learning "
Online course " BIG DATA from scratch "

Tags: