# Curiosity-Driven Learning – Exploration By Random Network Distillation8 min read

In recent years, Reinforcement Learning has proven itself to be a powerful technique for solving closed tasks with constant rewards, most commonly games. A major challenge in the field remains training a model when external feedback (reward) to actions is sparse or nonexistent. Recent models have tried to overcome this challenge by creating an intrinsic reward mechanism, mainly known as curiosity, that rewards the model for discovering new territories and states.

A new paper from OpenAI, Exploration by Random Network Distillation (RND), presents a novel approach for the intrinsic reward. The model tries to predict if a given state has been seen before and gives bigger rewards for unfamiliar states.

The model shows state-of-the-art results in several Atari games, including Montezuma’s Revenge, which is known to be hard for RL algorithms. It is also relatively simple and has proven to be effective in environments which include distracting background noise.

### Background

Reinforcement learning (RL) is a group of algorithms that are reward-oriented, meaning they learn how to act in different states by maximizing the rewards they receive from the environment. A challenging testbed for them are the Atari games that were developed more than 30 years ago, as they provide a rich visual input (210X160X3 per frame) and a diverse set of tasks that were designed to be difficult for humans.

The games differ in their complexity and the frequency of the external rewards. While in Breakout a reward is given every time you hit a brick, in Montezuma’s Revenge and others there are only a few rewards during a single level. Montezuma’s Revenge, for example, is known to be challenging, as it requires long (hundreds of steps) and complicated combinations of actions to pass deadly obstacles and find rewards. The animations below illustrate the difference between the games.

Left: Breakout – The agent receives external rewards frequently, one for each brick.
Right: Montezuma’s Revenge – The only external reward is received when picking up the key.

To succeed in games without frequent extrinsic rewards, the agent has to explore the environment in hope of discovering sparse rewards. These scenarios are common in real life, from finding lost keys in the house to discovering new cancer drugs. In such cases, the agent is required to use intrinsic rewards while acting mostly independently of extrinsic rewards. There exist two common approaches to RL with intrinsic rewards:

1. Count-based approaches that keep count of previously visited states, and give bigger rewards to novel states. The disadvantage of this approach is that it tends to become less effective as the number of possible states grows.
2. A different approach is ‘next-state prediction’, in which the model tries to predict the next state, takes action to move to the next state, and then minimizes the error compared to the predicted state. By exploring, more states become known and the error declines.

These approaches perform better than models that are based only on extrinsic rewards (such as the well-known models DQN and A3C), but still worse than an average human.

In general, when using intrinsic rewards, the assessment of future states suffers from three possible sources of error:

1. Unfamiliar states error – The model fails to generalize from previously-visited states to new states, resulting in high errors in the prediction of future states. By repeatedly discovering new states and learning from them, the model gradually reduces this kind of error.
2. Stochastic noise – This is also known as the Noisy-TV problem, in which a portion of the environment produces random noise (like a room with a TV which presents white noise). This causes many states to be new to the agent, and the next state is often unpredictable and unrelated to the agent’s action.
3. Model constraints – The model’s architecture is limited and can’t generalize the environment accurately enough to predict the next state. For example, the number and size of layers in a neural network needed to predict the next state are unknown.

### How does RND work?

RL systems with intrinsic rewards use the unfamiliar states error (Error #1) for exploration and aim to eliminate the effects of stochastic noise (Error #2) and model constraints (Error #3). To do so, the model requires 3 neural networks: A fixed target network that generates a constant output for a given state, a prediction network that tries to predict the target network’s output, and a policy network that decides on the agent’s next action.

#### Target and prediction networks

The target and prediction networks are used to generate a bigger intrinsic reward for unfamiliar states, by calculating the difference between the outputs of the two networks. The networks are of the same size and architecture – convolutional encoders (CNN) followed by fully-connected layers to embed the state into a features vector f. However, there is an important difference between them:

1. The target network is a neural network with fixed, randomized weights, which is never trained. Therefore, its output is constant for a given state (input) but variable between different states: ${f}_{i}\left(x\right)={f}_{j}\left(x\right)$ for any time step i, and ${f}_{i}\left(x\right)\ne {f}_{i}\left(y\right)$ for any two different inputs.
2. The prediction network is trained to predict the target network’s output. Each state is fed into both networks and the prediction network is trained to minimize the difference (MSE) between their outputs – ${r}_{i}={\left|{f}_{i+1}–f{‘}_{i+1}\right|}^{2}$.

As more states are fed into the system, the prediction network becomes better at predicting the target network output when it receives known states. When reaching previously-visited states, the agent receives a small reward (because the target output is predictable) and the agent is disincentivized to reach them again. In other words, unlike common models, the agent isn’t trying to predict the next state based on the current state and action, but instead tries to predict the novelty of a future state.

The target-prediction architecture has several benefits:

1. After enough training of states with random noise (from a fixed stationary distribution), the prediction network is able to better predict the outputs of the target network. As the prediction error decreases, the agent becomes less attracted to the noisy states than to other unexplored states. This reduces the Noisy-TV error (#2).
2. In next-step prediction models, it is unknown in advance which architecture (number of layers, layer size, etc) is required to model the result of an action. However, the predictor network only needs to predict the outcome of the target network. By having the same architecture as the target network, it should be able to learn the output of a familiar state properly. This “solves” model constraints error (#3).
3. It gives the agent a bias towards staying alive in the game, since dying forces it back to a familiar state. This benefit is shared by other ‘curiosity-based’ RL methods.

One of the challenges in this model is that the intrinsic reward decreases as more states become familiar and might vary between different environments, making it difficult to choose hyperparameters. To overcome this, the intrinsic rewards are normalized in each update cycle.

#### Policy network

The policy network’s role is to decide on the next action based on the current state and its internal model, which was trained on previous states. To make that decision, it uses an input embedder and a policy optimizer:

##### Input Embedder

The input embedder encodes the environment state to features. The paper compares 2 architectures – CNN or a mixture of CNN and recurrent layers (GRU cells). The recurrent layers are supposed to help with predicting the next action by capturing longer contexts of the game, e.g. events that happened before the current state, and were indeed found to perform better than CNN-only layers in most cases.

##### PPO

A major problem in training a policy model is convergence, since policies tend to change drastically as a result of a single update in rewards. For example, in some architectures, a single bad episode (game) can completely change your strategy. Therefore, on top of the embedding layers, the network has a Proximal Policy Optimizer (PPO) that predicts the next action based on the embedded state. The main contribution of PPO is optimizing the policy safely without radical updates, by bounding the difference between consecutive policies updates.

To update the policy, PPO first needs to estimate the future intrinsic and extrinsic rewards for a given state (“Value Head”). Handling each type of reward separately allows more flexibility in determining the effect each type has on the policy and in the way each type is calculated:

• The intrinsic rewards are counted over a fixed batch of time steps, e.g. 128 time steps, regardless of whether the agent has ‘died’ inside the game. The research found that this unique approach (non-episodic) enables better exploration, since it encourages the agent to take risky actions that might reveal new states. If the intrinsic rewards were episodic, these actions might have ended the game, thus ending the rewards.
• Extrinsic rewards are counted over an entire episode until the agent dies. Using non-episodic rewards might cause the agent to “hack” the game. For example, by finding easy and quick rewards and then killing itself.

The charts below shows the the policy network and the entire architecture:

Notes:

• The PPO in this paper is implemented using an Actor-Critic model (A3C). However, it can be used with any advantage function.
• An additional benefit of PPO is that it increases the training efficiency by allowing multiple epochs of training, with mini-batches of inputs in each epoch. Bounding the policy updates ensures that even with multiple epochs the total change won’t be too radical.

### Results

The paper compares, as a baseline, the RND model to state-of-the-art (SOTA) algorithms and two similar models as an ablation test:

1. A standard PPO without an intrinsic exploration reward.
2. A PPO model with intrinsic reward based on forward dynamics error. This model predicts the next state based on the current state and action, and minimize the prediction error, which is the intrinsic reward.

The RND agent achieved state-of-the-art results in 3 out of 6 games, and was able to reach a better score than the “average human” in Montezuma’s Revenge. However, its performance in 2 other games was significantly lower compared to other SOTA algorithms. The paper does not explain the nature of games in which this technique is less useful.

### Conclusions

The RND model exemplifies the progress that was achieved in recent years in hard exploration games. The innovative part of the model, the fixed and target networks, is promising thanks to its simplicity (implementation and computation) and its ability to work with various policy algorithms.

On the other hand, there is still a long way to go – there is no model to rule them all, and the performance varies among different games. In addition, while RNN might help a bit with keeping a longer context, global exploration is still a challenge. Scenarios that require long relations, e.g. using a key that was found in the first room to open the last door, are still unreachable.