Mountain Car: Solving the classic challenge with reinforcement training

    As a rule, modifications to algorithms that rely on the specific features of a particular task are considered less valuable, since they are difficult to generalize to a wider class of problems. However, this does not mean that such modifications are not needed. Moreover, often they can significantly improve the result even for simple classical problems, which is very important in the practical application of algorithms. As an example, in this post I will solve the Mountain Car problem using reinforcement training and show that using knowledge of how the task is organized, it can be solved much faster.



    About myself


    My name is Oleg Svidchenko, now I am studying at the School of Physical, Mathematical and Computer Sciences of the St. Petersburg HSE, before that I studied at St. Petersburg University for three years. I also work at JetBrains Research as a researcher. Before entering the university, I studied at the SSC of Moscow State University and became the winner of the All-Russian Olympiad of schoolchildren in computer science as part of the Moscow team.

    What do we need?


    If you are interested in trying reinforcement training, the Mountain Car challenge is great for this. Today we need Python with the installed Gym and PyTorch libraries , as well as basic knowledge of neural networks.

    Task description


    In a two-dimensional world, a car needs to climb from the hollow between two hills to the top of the right hill. It is complicated by the fact that she does not have enough engine power to overcome the force of gravity and enter there on the first attempt. We are invited to train an agent (in our case, a neural network), who can, by controlling it, climb the right hill as quickly as possible.

    Machine control is carried out through interaction with the environment. It is divided into independent episodes, and each episode is carried out step by step. At each step, the agent receives state s and environment r from the environment in response to the action a . In addition, sometimes the medium may additionally report that the episode is over. In this problem s- a pair of numbers, the first of which is the position of the car on the curve (one coordinate is enough, since we can’t tear ourselves off the surface), and the second is its speed on the surface (with a sign). The reward r is a number always equal to -1 for this task. In this way, we encourage the agent to complete the episode as quickly as possible. There are only three possible actions: push the car to the left, do nothing and push the car to the right. These actions correspond to numbers from 0 to 2. The episode may end if the car reaches the top of the right hill or if the agent has taken 200 steps.

    Bit of theory


    On Habré there was already an article about DQN in which the author fairly well described all the necessary theory. Nevertheless, for ease of reading, I will repeat it here in a more formal form.

    The reinforcement learning task is defined by a set of state space S, action space A, coefficient$ \ gamma $, the transition functions T and the reward functions R. In general, the transition function and the reward function can be random variables, but now we will consider a simpler version in which they are uniquely defined. The goal is to maximize cumulative rewards.$ \ sum_ {t = 0} ^ {T} r_ {t} \ cdot \ gamma ^ {t} $, where t is the step number in the medium, and T is the number of steps in the episode.

    To solve this problem, we define the value-function V of state s as the value of the maximum cumulative reward, provided that we start in state s. Knowing such a function, we can solve the problem simply by passing at each step to s with the maximum possible value. However, not everything is so simple: in most cases we do not know what action will bring us to the desired state. Therefore, we add the action a as the second parameter of the function. The resulting function is called a Q-function. It shows what maximum possible cumulative reward we can get by performing action a in state s. But we can already use this function to solve the problem: being in state s, we simply choose a such that Q (s, a) is maximal.

    In practice, we do not know the real Q-function, but we can approximate it by various methods. One such technique is the Deep Q Network (DQN). His idea is that for each of the actions we approximate the Q-function using a neural network.

    Environment


    Now let's get to practice. First, we need to learn how to emulate the MountainCar environment. The Gym library, which provides a large number of standard reinforcement learning environments, will help us cope with this task. To create an environment we need to call the make method on the gym module passing it the name of the desired environment as a parameter:
    import gym
    env = gym.make("MountainCar-v0")
    

    Detailed documentation can be found here , and a description of the environment can be found here .
    Let us consider in more detail what we can do with the environment we created:

    • env.reset()- ends the current episode and starts a new one. Returns the initial state.
    • env.step(action)- performs the specified action. Returns a new state, a reward, whether the episode has ended and additional information that can be used for debugging.
    • env.seed(seed)- installs random seed. It depends on how the initial states will be generated during env.reset ().
    • env.render() - displays the current state of the environment.

    We realize DQN


    DQN is an algorithm that uses a neural network to evaluate a Q-function. In the original article, DeepMind defined the standard architecture for Atari games using convolutional neural networks. Unlike these games, Mountain Car does not use the image as a state, so we will have to determine the architecture ourselves.

    Take, for example, an architecture with two hidden layers of 32 neurons in each. After each hidden layer, we will use ReLU as an activation function. Two numbers describing the state are fed to the input of the neural network, and at the output we get an estimate of the Q-function.

    Neural network architecture

    import torch.nn as nn
    model = nn.Sequential(
        nn.Linear(2, 32),
        nn.ReLU(),
        nn.Linear(32, 32),
        nn.ReLU(),
        nn.Linear(32, 3)
    )
    target_model = copy.deepcopy(model)
    #Инициализация весов нейронной сети
    def init_weights(layer):
        if type(layer) == nn.Linear:
           nn.init.xavier_normal(layer.weight)
    model.apply(init_weights)
    

    Since we will train the neural network on the GPU, we need to load our network there:

    # Если хотим обучать на CPU, то “cuda” нужно заменить на “cpu”
    device = torch.device("cuda")
    model.to(device)
    target_model.to(device)
    

    The device variable will be global, since we will also need to load the data.

    We also need to define an optimizer that will update the model weights using gradient descent. Yes, there are many more than one.

    optimizer = optim.Adam(model.parameters(), lr=0.00003)
    

    Together
    import torch.nn as nn
    import torch
    device = torch.device("cuda")
    def create_new_model():
        model = nn.Sequential(
            nn.Linear(2, 32),
            nn.ReLU(),
            nn.Linear(32, 32),
            nn.ReLU(),
            nn.Linear(32, 3)
        )
        target_model = copy.deepcopy(model)
        #Инициализация весов нейронной сети
        def init_weights(layer):
            if type(layer) == nn.Linear:
                nn.init.xavier_normal(layer.weight)
        model.apply(init_weights)
        #Загружаем модель на устройство, определенное в самом начале (GPU или CPU)
        model.to(device)
        target_model.to(device)
        #Сразу зададим оптимизатор, с помощью которого будем обновлять веса модели
        optimizer = optim.Adam(model.parameters(), lr=0.00003)
        return model, target_model, optimizer
    


    Now we declare a function that will consider the error function, the gradient along it, and apply the descent. But before that you need to download data from the batch to the GPU:

    state, action, reward, next_state, done = batch
    # Загружаем батч на выбранное ранее устройство
    state = torch.tensor(state).to(device).float()
    next_state = torch.tensor(next_state).to(device).float()
    reward = torch.tensor(reward).to(device).float()
    action = torch.tensor(action).to(device)
    done = torch.tensor(done).to(device)
    

    Next, we need to calculate the real values ​​of the Q-function, however, since we do not know them, we will evaluate them through the values ​​for the following state:

    target_q = torch.zeros(reward.size()[0]).float().to(device)
    with torch.no_grad():
        # Выбираем максимальное из значений Q-function для следующего состояния
        target_q[done] = target_model(next_state).max(1)[0].detach()[done]
    target_q = reward + target_q * gamma
    

    And the current prediction:

    q = model(state).gather(1, action.unsqueeze(1))
    

    Using target_q and q, we calculate the loss function and update the model:

    loss = F.smooth_l1_loss(q, target_q.unsqueeze(1))
    # Очищаем текущие градиенты внутри сети
    optimizer.zero_grad()
    # Применяем обратное распространение  ошибки
    loss.backward()
    # Ограничиваем значения градиента. Необходимо, чтобы обновления не были слишком большими
    for param in model.parameters():
       param.grad.data.clamp_(-1, 1)
    # Делаем шаг оптимизации
    optimizer.step()
    

    Together
    gamma = 0.99
    def fit(batch, model, target_model, optimizer):
        state, action, reward, next_state, done = batch
        # Загружаем батч на выбранное ранее устройство
        state = torch.tensor(state).to(device).float()
        next_state = torch.tensor(next_state).to(device).float()
        reward = torch.tensor(reward).to(device).float()
        action = torch.tensor(action).to(device)
        done = torch.tensor(done).to(device)
        # Считаем то, какие значения должна выдавать наша сеть
        target_q = torch.zeros(reward.size()[0]).float().to(device)
        with torch.no_grad():
            # Выбираем максимальное из значений Q-function для следующего состояния
            target_q[done] = target_model(next_state).max(1)[0].detach()[done]
        target_q = reward + target_q * gamma
        # Текущее предсказание
        q = model(state).gather(1, action.unsqueeze(1))
        loss = F.smooth_l1_loss(q, target_q.unsqueeze(1))
        # Очищаем текущие градиенты внутри сети
        optimizer.zero_grad()
        # Применяем обратное распространение  ошибки
        loss.backward()
        # Ограничиваем значения градиента. Необходимо, чтобы обновления не были слишком большими
        for param in model.parameters():
            param.grad.data.clamp_(-1, 1)
        # Делаем шаг оптимизации
        optimizer.step()
    


    Since the model only considers the Q-function, and does not perform actions, we need to determine the function that will decide which actions the agent will perform. As a decision-making algorithm, we take$ \ varepsilon $-greedy politics. Her idea is that the agent usually performs actions greedily, choosing the maximum of the Q-function, but with probability$ \ varepsilon $he will take a random action. Random actions are needed so that the algorithm can examine those actions that it would not have carried out guided only by a greedy policy - this process is called exploration.

    def select_action(state, epsilon, model):
        if random.random() < epsilon:
            return random.randint(0, 2)
        return model(torch.tensor(state).to(device).float().unsqueeze(0))[0].max(0)[1].view(1, 1).item()
    

    Since we use batches to train the neural network, we need a buffer in which we will store the experience of interacting with the environment and from where we will choose batches:

    class Memory:
        def __init__(self, capacity):
            self.capacity = capacity
            self.memory = []
            self.position = 0
        def push(self, element):
            """Сохраняет элемент в циклический буфер"""
            if len(self.memory) < self.capacity:
                self.memory.append(None)
            self.memory[self.position] = element
            self.position = (self.position + 1) % self.capacity
        def sample(self, batch_size):
            """Возвращает случайную выборку указанного размера"""
            return list(zip(*random.sample(self.memory, batch_size)))
        def __len__(self):
            return len(self.memory)
    

    Naive decision


    First, declare the constants that we will use in the learning process, and create a model:

    #Количество обновлений model между обновлениями target model
    target_update = 1000
    #Размер одного батча, который на вход принимает модель
    batch_size = 128
    #Количество шагов среды
    max_steps = 100001
    #Границы коэффициента exploration
    max_epsilon = 0.5
    min_epsilon = 0.1
    #Создаем модель и буфер
    memory = Memory(5000)
    model, target_model, optimizer = create_new_model()
    

    Despite the fact that it would be logical to divide the interaction process into episodes, to describe the learning process it is more convenient for us to divide it into separate steps, since we want to take one step of gradient descent after each step of the environment.

    Let's talk in more detail about how one step of learning looks like here. We assume that now we are making a step with the step number of max_steps steps and the current state state. Then doing the action with$ \ varepsilon $-greedy policies would look like this:

    epsilon = max_epsilon - (max_epsilon - min_epsilon)* step / max_steps
    action = select_action(state, epsilon, model)
    new_state, reward, done, _ = env.step(action)
    

    Immediately add the gained experience to memory and start a new episode if the current one has ended:

    memory.push((state, action, reward, new_state, done))
    if done:
          state = env.reset()
          done = False
    else:
          state = new_state
    

    And we will take the step of gradient descent (if, of course, we can already collect at least one batch):

    if step > batch_size:
         fit(memory.sample(batch_size), model, target_model, optimizer)
    

    Now it remains to update target_model:

    if step % target_update == 0:
          target_model = copy.deepcopy(model)
    

    However, we would also like to follow the learning process. To do this, we will play an additional episode after each update of target_model with epsilon = 0, storing the total reward in the rewards_by_target_updates buffer:

    if step % target_update == 0:
          target_model = copy.deepcopy(model)
          state = env.reset()
          total_reward = 0
          while not done:
                action = select_action(state, 0, target_model)
                state, reward, done, _ = env.step(action)
                total_reward += reward
          done = False
          state = env.reset()
          rewards_by_target_updates.append(total_reward)
    

    Together
    #Количество обновлений model между обновлениями target model
    target_update = 1000
    #Размер одного батча, который на вход принимает модель
    batch_size = 128
    #Количество шагов среды
    max_steps = 100001
    #Границы коэффициента exploration
    max_epsilon = 0.5
    min_epsilon = 0.1
    def fit():
          #Создаем модель и буфер
          memory = Memory(5000)
          model, target_model, optimizer = create_new_model()
          for step in range(max_steps):
                #Делаем шаг в среде
                epsilon = max_epsilon - (max_epsilon - min_epsilon)* step / max_steps
                action = select_action(state, epsilon, model)
                new_state, reward, done, _ = env.step(action)
                #Запоминаем опыт и, если нужно, перезапускаем среду
                memory.push((state, action, reward, new_state, done))
                if done:
                      state = env.reset()
                      done = False
                else:
                      state = new_state
                #Градиентный спуск
                if step > batch_size:
                     fit(memory.sample(batch_size), model, target_model, optimizer)
                if step % target_update == 0:
                      target_model = copy.deepcopy(model)
                      #Exploitation
                      state = env.reset()
                      total_reward = 0
                      while not done:
                            action = select_action(state, 0, target_model)
                            state, reward, done, _ = env.step(action)
                            total_reward += reward
                      done = False
                      state = env.reset()
                      rewards_by_target_updates.append(total_reward)
          return rewards_by_target_updates
    


    Run this code and get something like this graph:

    Baseline graph in the form of a straight line y = -200

    Something went wrong?


    Is this a bug? Is this the wrong algorithm? Are these bad parameters? Not really. In fact, the problem is in the task, namely in the function of the reward. Let's look at it more closely. At each step, our agent receives a reward of -1, and this happens until the episode ends. Such a reward motivates the agent to complete the episode as quickly as possible, but at the same time does not tell him how to do it. Because of this, the only way to learn how to solve a problem in such a formulation for an agent is to solve it many times using exploration.

    Of course, one could try to use more complex algorithms to study the environment instead of ours$ \ varepsilon $-greedy policies. However, firstly, due to their application, our model will become more complex, which we would like to avoid, and secondly, not the fact that they will work well enough for this task. Instead, we can eliminate the source of the problem by modifying the task itself, namely by changing the reward function, i.e. by applying what is called reward shaping.

    Speeding up convergence


    Our intuitive knowledge tells us that to drive up the hill you need to accelerate. The higher the speed, the closer the agent to solving the problem. You can tell him about this, for example, by adding a speed module with a certain coefficient to the reward:
    modified_reward = reward + 10 * abs (new_state [1])


    Accordingly, a line in the function fit
    memory.push ((state, action, reward, new_state, done))
    should be replaced by
    memory.push ((state, action, modified_reward, new_state, done))
    Now let's look at the new chart (it presents the original award without modifications):

    Baseline versus RS graph
    Here RS is an abbreviation for Reward Shaping.

    Is it good to do this?


    The progress is obvious: our agent clearly learned to drive up the hill, as the award began to differ from -200. There is only one question left: if changing the function of the reward, we also changed the task itself, will the solution to the new problem that we found be good for the old problem?

    To begin with, we understand what “goodness” means in our case. Solving the problem, we are trying to find the optimal policy - one that maximizes the total reward for the episode. In this case, we can replace the word “good” with the word “optimal”, because we are looking for it. We also optimistically hope that sooner or later our DQN will find the optimal solution for the modified problem, and not get stuck at a local maximum. So, the question can be reformulated as follows: if changing the function of the reward, we also changed the problem itself, will the optimal solution to the new problem that we found be optimal for the old problem?

    As it turns out, we cannot provide such a guarantee in the general case. The answer depends on how exactly we changed the function of the reward, how it was arranged earlier and how the environment itself is arranged. Fortunately, there is an article whose authors investigated how changing the function of the reward affects the optimality of the solution found.

    First, they found a whole class of “safe” changes that are based on the potential method:$ R '= R + (\ gamma \ cdot \ Phi (new \ _state) - \ Phi (state)) $where $ \ Phi $- potential, which depends only on the state. For such functions, the authors were able to prove that if the solution for the new problem is optimal, then for the old problem it is also optimal.

    Secondly, the authors showed that for any other$ R '= R + F (s, a) $there is such a problem, the R reward function, and the optimal solution to the changed problem, that this solution is not optimal for the original problem. This means that we cannot guarantee the goodness of the solution we found if we use a change that is not based on the potential method.

    Thus, the use of potential functions to modify the reward function can only change the convergence rate of the algorithm, but does not affect the final solution.

    Speed ​​up convergence correctly


    Now that we know how to safely change the reward, let's try to modify the task again, using the potential method instead of naive heuristics:
    modified_reward = reward + 300 * (gamma * abs (new_state [1]) - abs (state [1]))

    Let's look at the schedule of the original award:

    Graph comparing baseline, RS and RS with potentials

    As it turned out, in addition to having theoretical guarantees, modifying the award using potential functions also significantly improved the result, especially in the early stages. Of course, there is a chance that it would be possible to select more optimal hyperparameters (random seed, gamma, and other coefficients) for training the agent, but reward shaping nevertheless significantly increases the rate of convergence of the model.

    Afterword


    Thank you for reading to the end! I hope you enjoyed this little practice-oriented excursion into reinforced learning. It is clear that Mountain Car is a “toy” task, however, as we were able to notice, to teach an agent to solve even such a seemingly simple task from a human point of view can be difficult.

    Also popular now: