Human-level control through deep reinforcement learning

Notes

Reinforcement learning methods that use non-linear function approximators (e.g. neural networks) for the action value function are not theoretically stable.

This paper gets around that problem with two main changes to training:

  1. Experience replay

Each time step is recorded with: state, action, reward, and next state. A random mini-batch of experiences is drawn and used for a Q-learning update.

  1. Separate target and behavior value networks

Updates are performed on a copy of the network. Only after a number of such updates is the improved network copied back into the one used to choose actions.