Temporal difference learning

From Conservapedia
Jump to: navigation, search

Temporal difference learning (TD learning) is an algorithmic method for solving problems using reinforcement learning. It is predominantly used in machine learning and was initially developed in that context. However, recent discoveries in neuroscience have shown that the dopamine mediated cells in the ventral tegmental area have a firing pattern that closely mimics the backbone of TD learning. This has lead many researchers to believe that something similar to the TD algorithm might underlie how people and animals actually learn.

Temporal difference algorithm

TD methods rely on breaking up tasks into a Markov chain with each state in the chain containing four key pieces of information: the state the system is in, the actions that can be performed at that state, the transition probabilities of moving to another state based on the action performed, and the reward or punishment value for being in that state. The algorithm then attempts to learn the optimal actions at each state to maximize reward and minimize punishment. The heart of the TD algorithm is the idea that one can predict the consequences of entering a new state based on your previous experiences in that state. So when faced with the decision on which action to choice the TD algorithm uses a cached value for each of the possible states it can move to, based on its prior exposure to the state. This temporal element to the problem is where it gets its name. The algorithm updates its predictions of the consequences for the state by computer a TD error term which is the difference between the predicted reward and the actual reward. This TD error can be used to change the cached value for each state.

There are two elements to solving a temporally restricted Markov chain problem. The first is actually learning the values of the reward and transition functions, the second is choosing the action at a particular state that will net you the greatest reward. In theory it would be most optimal to learn all the rewards and transition functions before deciding on your action policy, but in practice you must do both at the same time. Most TD algorithms rely on a method called actor-critic. Where the actor will chose an action based on the best information available, and then the critic will learn based on that behavior and adjust the predictions. Because time is usually limited maximizing total reward must carefully balance the actor always choosing the action it thinks is the best and occasionally choosing options it is unsure of to see if they might be better. This is known as the problem of exploitation and exploration, there are various methods to try and maximize the returns from both.

Temporal difference learning and dopamine

Evidence from animal models showed that the dopamine mediated cells in the ventral tegmental area fire in a way almost identical to how the TD error term is calculated. When a reward is unexpected the phasic firing rate is increased, when it is less than expected it decreases. Also the cells will learn to fire at the sign of the first predictive stimulus for the reward. This is just like the TD error signal.

This has led many to speculate that animals and people learn reinforcement problems using something analogous to TD learning.