Notes
Reinforcement learning methods that use non-linear function approximators (e.g. neural networks) for the action value function are not theoretically stable.
This paper gets around that problem with two main changes to training:
- Experience replay
Each time step is recorded with: state, action, reward, and next state. A random mini-batch of experiences is drawn and used for a Q-learning update.
- Separate target and behavior value networks
Updates are performed on a copy of the network. Only after a number of such updates is the improved network copied back into the one used to choose actions.
