|Context||NIPS 2015 Tutorial|
Reinforcement learning is agent-oriented learning, e.g. learning by interacting with your environment to achieve a goal. It's more realistic than other kinds of machine learning, because you have to learn by trial and error as arguably animals and humans do, but it's also harder, because you only get weak evaluative delayed training information (rewards). The learner has to tell itself whether it's right or wrong, and how to take the correct actions. It is maybe the beginnings of the science of mind, which is neither natural science nor applications theory. Reinforcement learning touches on many different fields - computer science, psychology, neuroscience, goals systems/economics, specifically machine learning, classical/operant conditioning, reward systems, operations research, etc. It has applications, e.g. to robotics (learning what sequence of motions results in a desired result).
The RL “API” is that an agent receives a state from the environment, and produces an action/response/control, which results in a single scalar number (reward) and changes the environment state, which is then passed to the agent anew, and so on. The environment may be unknown, nonlinear, stochastic, and/or complex, in contrast to classical control theory - no shortcuts! The agent learns a policy mapping states to actions to maximize the total reward over time. Again, the consequences of actionsmay be delayed, and are received sequentially, and may change over time.
Early on, in 1995, Tesauro developed a reinforcement learning system which could solve backgammon better than human players and all existing computer players. The system learned a representation of the backgammon game state with a neural network, and then chose actions by a shallow search, playing games against itself to try to improve its behavior over time. It's been used for other games, such as Jeopardy, which was successfully played against human players by IBM. Google DeepMind also developed a system which was flexible enough to determine high-performing strategies for a wide variety of Atari games. In other domains, in 2006, researchers at Stanford learned to fly helicopters automatically. Reinforcement learning is widely used to place and select advertisements on the internet. In all these cases, the performance was better by any other method, and was attained without human instruction.
Reinforcement learner systems must balance exploration and exploitation, in order to accrue rewards in the long run. So, taking a bad action which may be bad at a given point in time is fine but it may be better if a large reward can find later. In very simple cases, the environment/state/world can be viewed as having a finite number of states; a finite number of actions can cause transitions between these states, and can result in finite (but stochastic) rewards. The life of a reinforcement learner can be then formulated as a sequence of state, action, reward; state, action, reward… The goal is then to compute the probability of a certain reward, and the next state, given the current state and each action.
The agent is searching for a policy from state to action. Informally, we'd like to choose each action to maximize the discounted sum of future rewards. They can then all be summed up to infinity; because the sum is discounted, the number will be finite. The number of policies is exponential in the number of states, so searching the space to find the best policy is prohibitively expensive. To mitigate this, action-value functions are used, which show how good it is to be in a certain action and be in a certain state. The value of an action-value function is the sum of discounted rewards, given the current state and the current action and forever afterwards using a certain policy. A policy is optimal if it maximizes the action-value function, so all optimal policies share the same optimal value function. Given an optimal policy, acting optimal is simple - just choose the action which achieves the maximum according to the optimal value function, which makes the optimal policy greedy with respect to the optimal value function. There is always at least one such optimal deterministic policy. We need a way to try to find them.
The simplest RL algorithm is Q-learning, which initializes an array Q arbitrarily. Then, actions are chosen in any way, maybe based on the current estimate of Q, so that all actions are taken in all states (until infinity in the limit). At each step, one element of the array Q is updated according to the next reward, plus an estimate of the discounted version of what we expect at the next state, minus the reward we get at the current state. This updates is applied with a step size which is minimized over time. Eventually, Q will converge to the optimal strategy for an arbitrary Markov decision process. This decouples the policy being used, and the optimal policy being sought (off-policy learning).
The “policy improvement theorem” states that given any policy, it can always be “greedified” to find a better policy - that is, given the policy, a better policy can be found by finding an action which has a better policy value. Exploiting this results in “policy iteration” (or the “dance of policy and value”), where first the policy is evaluated, then greedified, then evaluated, and so on until convergence - each policy will always be better then the last, and eventually globally optimal - there are no local optima. And, there's only a finite number of policies though there may be many. Nevertheless, this approach often converges quickly. It's robust - it doesn't matter how you initialize things, do the updates, update its states or only partially, add randomization and noise, etc. In fact, if will still work if only a single state is updated at a time by a random amount which is only correct in expectation.
You can't do the action that you think is best all the time, because then you will miss out forever if you're wrong. In fact, you need to explore all state/actions an infinite number of times. You also can't explore all the time, or else you would never get any advantage of the learning. So, in Q-learning, actions are chosen based on the policy, except that it is required that all state/actions are tried. This can be allowed by using off-policy learning.
Bootstrapping is a method to update an estimate from an estimate, based on the Bellman equations, which essentially show that the optimal policy can be compute recursively. When the optimal action-value function is obtained, all actions can be considered, and the optimal policy is no longer dependent on all future actions.
In practice, we can't find optimal policies in tables, so we must use a function approximator. The first and most important way to use function approximation is to approximate the approximate action-value function. This function will depend on the state, action, and some parameter values which are estimated, and it should be an approximation to the optimal policy. The approximator could be a deep neural network, or a linear function, with the weights being the parameter. It's a powerful concept because it subsumes most of the problem in the hidden state. If we want to scale up our problems, the computation needs to scale linearly in the number of parameters. Often, this works well, but there are counterexamples which show that certain things are impossible - where the parameters diverge to infinity.
One way to find an approximation is to compute a loss function based on the Bellman optimality equation - e.g. the loss is the expectation of the Bellman optimality equation. This produces an update for the parameters based on the current equation and the gradient. The target value depends on the current parameterization, which is bad, but is in practice ignored.
If we formulate the loss function in terms of the Bellman expectation equation, and again ignore the fact that the target depends on the parameters, then we can proceed as before. Now, there is no max in the objective function. The algorithm is on-policy, and so the policy should be close to greedy, which we call epsilon-greedy - a stochastic policy which is usually greedy but with some probability does something at random. Epsilon-greedy is a simple way to obtain exploration. On-policy methods have much better convergence properties when the function approximator is linear, though they may not converge rapidly for control applications. For non-linear function approximation, there is one known counter-example although it's artificial and contrived. For linear functions, it's important to encode useful features about the state. In practice, on-policy methods tend to work better than off-policy, but they find worse policies, which is to say that they may behave better (“safer”) in general but may not eventually find the best policy.
Optimizing linear function approximation can be unstable/diverge, regardless of learning or sampling, exploration, greedification, control, or whether the function approximator is linear or nonlinear. To avoid this problem, you must avoid doing all three of these things at the same time (two at a time is ok): Function approximation (generalizing from a large number of examples), bootstrapping (learning value estimates from other estimates), and off-policy learning (learning about the greedy policy with a more exploratory policy, which is useful because you can learn about different policies for different goals at the same time). All of these things are useful, however. Bootstrapping allows for efficiency, but it introduces bias which may harm the asymptotic behavior of approximation methods. There's a way to trade-off the amount of bootstrapping (called lambda); using no bootstrapping is generally very bad but using very little bootstrapping is often beneficial. Using good features/making function approximation easier can also help.