Understanding Q-learning, the problem “Walking on a rock”
In the last post, we presented the problem “Walking on a rock” and settled on a terrible algorithm that did not make sense. This time we will reveal the secrets of this gray box and see that it is not so scary at all.
Summary
We concluded that by maximizing the amount of future rewards, we also find the fastest path to the goal, so our goal now is to find a way to do this!
Introduction to Q-Learning
- Let's start by building a table that measures how well a certain action will perform in any state (we can measure it with a simple scalar value, so the larger the value, the better the action)
- This table will have one row for each state and one column for each action. In our world, the grid has 48 (4 by Y by 12 by X) states and 4 actions are allowed (up, down, left, right), so the table will be 48 x 4.
- The values stored in this table are called “Q-values.”
- These are estimates of the amount of future rewards. In other words, they estimate how much more rewards we can get before the end of the game, being in state S and performing action A.
- We initialize the table with random values (or some constant, for example, all zeros).
The optimal “Q-table” has values that allow us to take the best actions in each state, giving us the best way to win in the end. The problem is solved, cheers, Robot Lords :).
Q-values of the first five states.
Q-learning
- Q-learning is an algorithm that “learns” these values.
- At every step we get more information about the world.
- This information is used to update the values in the table.
For example, suppose we are one step away from the target (square [2, 11]), and if we decide to go down, we get a reward of 0 instead of -1.
We can use this information to update the value of the state-action pair in our table, and the next time we visit it, we will already know that moving down gives us a reward of 0.
Now we can spread this information even further! Since we now know the path to the goal from the square [2, 11], any action that leads us to the square [2, 11] will also be good, therefore we update the Q-value of the square, which leads us to [2, 11] to be closer to 0.
And that, ladies and gentlemen, is the essence of Q-learning!
Please note that every time we reach the goal, we increase our “map” of how to achieve the goal by one square, so after a sufficient number of iterations we will have a complete map that will show us how to get to the goal from each state.
A path is generated by taking the best action in every state. Green key represents the meaning of a better action, more saturated keys represent a higher value. The text represents a value for each action (up, down, right, left).
Bellman equation
Before we talk about code, let's talk about math: the basic concept of Q-learning, the Bellman equation.
- First let's forget γ in this equation
- The equation states that the Q value for a particular state-action pair should be the reward received upon transition to a new state (by performing this action), added to the value of the best action in the next state.
In other words, we disseminate information about the values of actions one step at a time!
But we can decide that receiving a reward right now is more valuable than receiving a reward in the future, and therefore we have γ, a number from 0 to 1 (usually from 0.9 to 0.99) that is multiplied by a reward in the future, discounting future rewards.
So, given γ = 0.9 and applying this to some states of our world (grid), we have:
We can compare these values with the above in GIF and see that they are the same.
Implementation
Now that we have an intuitive understanding of how Q-learning works, we can start thinking about implementing all of this, and we will use the Q-learning pseudo-code from Sutton's book as a guide.
Pseudo-code from Sutton's book.
The code:
# Initialize Q arbitrarily, in this case a table full of zeros
q_values = np.zeros((num_states, num_actions))
# Iterate over 500 episodes
for _ in range(500):
state = env.reset()
done = False
# While episode is not over
while not done:
# Choose action
action = egreedy_policy(q_values, state, epsilon=0.1)
# Do the action
next_state, reward, done = env.step(action)
# Update q_values
td_target = reward + gamma * np.max(q_values[next_state])
td_error = td_target - q_values[state][action]
q_values[state][action] += learning_rate * td_error
# Update state
state = next_state
- First, we say: “For all states and actions, we initialize Q (s, a) arbitrarily”, this means that we can create our table of Q-values with any values that we like, they can be random, they can to be any kind of permanent does not matter. We see that in line 2 we create a table full of zeros.
We also say: “The value of Q for the final states is zero”, we cannot take any action in the final states, therefore we consider the value for all actions in this state to be zero.
- For each episode, we have to “initialize S”, this is just a fancy way of saying “restart the game”, in our case it means putting the player in the starting position; in our world there is a method that does just that, and we call it on line 6.
- For each time step (every time we need to act), we must choose the action obtained from Q.
Remember, I said, “are we taking the actions that are most valuable in every condition?
When we do this, we use our Q-values to create the policy; in this case it will be a greedy policy, because we always take actions that, in our opinion, are best in every state, therefore it is said that we act greedily.
Junk
But there is a problem with this approach: imagine that we are in a labyrinth that has two rewards, one of which is +1, and the other is +100 (and every time we find one of them, the game ends). Since we always take actions that we consider to be the best, we will be stuck with the first award found, always returning to it, therefore, if we first recognize the +1 award, then we will miss the big +100 award.
Decision
We need to make sure that we have studied our world sufficiently (this is an amazingly difficult task). This is where ε comes into play. ε in the greedy algorithm means that we must act eagerly, BUT to do random actions as a percentage of ε over time, so with an infinite number of attempts, we must examine all the states.
The action is selected in accordance with this strategy in line 12, with epsilon = 0.1, which means that we are doing research on the world 10% of the time. The implementation of the policy is as follows:
def egreedy_policy(q_values, state, epsilon=0.1):
# Get a random number from a uniform distribution between 0 and 1,
# if the number is lower than epsilon choose a random action
if np.random.random() < epsilon:
return np.random.choice(4)
# Else choose the action with the highest value
else:
return np.argmax(q_values[state])
In line 14 in the first listing, we call the step method to perform the action, the world returns us the next state, reward and information about the end of the game.
Back to the math:
We have a long equation, let's think about it:
If we take α = 1:
Which exactly matches the Bellman equation, which we saw a few paragraphs ago! So we already know that this is the line responsible for disseminating information about state values.
Но обычно α (в основном известная как скорость обучения) намного меньше 1, его основная цель — избежать больших изменений в одном обновлении, поэтому вместо того, чтобы лететь в цель, мы медленно приближаемся к ней. В нашем табличном подходе установка α = 1 не вызывает никаких проблем, но при работе с нейронными сетями (подробнее об этом в следующих статьях) все может легко выйти из-под контроля.
Глядя на код, мы видим, что в строке 16 в первом листинге мы определили td_target, это значение, к которому мы должны приблизиться, но вместо прямого перехода к этому значению в строке 17 мы вычисляем td_error, мы будем использовать это значение в сочетании со скоростью обучения, чтобы медленно двигаться к цели.
Помните, что это уравнение является сущностью Q-Learning.
Now we just need to update our state, and everything is ready, this is line 20. We repeat this process until we reach the end of the episode, dying in the rocks or reaching the goal.
Conclusion
Now we intuitively understand and know how to encode Q-Learning (at least the tabular version), be sure to check all the code used for this post, available on GitHub .
Visualization of the learning process test:
Please note that all actions begin with a value that exceeds its final value, this is a trick to stimulate exploration of the world.