# http://colinraffel.com/wiki/

### Site Tools

learning_to_execute

# Learning to Execute

 Authors Wojciech Zaremba, Ilya Sutskever Publication info Retrieved here Retrieval date 2/9/15

Uses long-short term memory recurrent neural networks to learn to evaluate short Python programs, using a variant of curriculum learning.

## Problems Attempted

This paper focuses on the problem of predicting the output of a Python program which can be evaluated in $O(n)$ time and constant memory, which is dictated by the fact that the RNN can only read the sequence one and has limited memory. Allowed statements include addition, subtraction, multiplication, variable assignment, if-statements, for-loops (single loop only), and the allowed output is an integer. The programs are parameterized by length (max number of digits in integers in the program) and allowed nesting level (composition of functions). For multiplication, one argument is constrained to always be small; same with for loop range. The network reads the program one character at a time, with no implicit knowledge of symbol semantics. There is relatively little bias in which symbols appear in the output.

In addition to this generic program task, two simpler tasks were attempted. The first is simple addition of two randomly generated numbers of the same (variable) length. The second is sequence memorization, where after seeing a complete input sequence the network must reproduce the sequence. In this task, two input modification techniques were tested: Reversing and repeating the input.

## Creating training examples

Because the complexity of the task can be varied using the length and nesting parameters, curriculum learning can be used where the network is trained on harder and harder tasks. Given a task with length $a$ and nesting $b$, this paper compared four approaches: Baseline, where all training sequences had length $a$ and nesting $b$; naive curriculum learning where the network is trained first with examples of length1 until convergence, then 2, all the way up to $a$, then do the same for nesting up to $b$; mixed, where length/nesting are picked randomly from $[1, a]$ and $[1, b]$ respectively; and combined, where some training cases come from naive and others from mixed.

## Network Structure

A multi-layer network with long short-term memory recurrent layers was used. Long short-term memory networks include multiplicative connections which allow them to store information for a longer amount of time, allowing them to be trained more easily on problems with long-term dependencies. Assuming all layers are dimensionality $n$ with the hidden and cell vectors of layer $l$ at time $t$ written $h^l_t$ and $c^l_t$ respectively, LSTM can be written as a function $h_t^{l - 1}, h_{t - 1}^l, c_{t - 1}^l \rightarrow h_t^l, c_t^l$ as follows: \begin{align} \left( \begin{array}{c} i \\ f \\ o \\ g\end{array} \right) &= \left( \begin{array}{c} \mathrm{sigm} \\ \mathrm{sigm} \\ \mathrm{sigm} \\ \tanh \end{array} \right) T_{2n, 4n} \left( \begin{array}{c} h^{l - 1}_t \\ h^l_{t - 1} \\ \end{array} \right)\\ c_t^l &= f \odot c_{t - 1}^l + i \odot g\\ h_t^l &= o \odot \tanh(c_t^l) \end{align} where $T_{n, m}$ is a biased linear mapping, i.e. $Wx + b$.

The network used in all experiments had two layers, each with 400 units per layer, all initialized uniformly between $[-.08, .08]$. Training was done with mini batch gradient descent, with the norm of the gradient clipped to 5, a minibatch size of 100, and a learning rate of $.5$ which was decreased by a factor of $.8$ when 95% accuracy was achieved on the training set. Training was stopped when the learning rate was smaller than .001 for the program output prediction tasks and after 10 million samples for the input copying task. Hidden states were initialized to zero, but retained across minibatches. When computing the output of the network, the correct $i$th digit was used for conditioning when predicting the $i+1$th digit (teacher forcing).

## Experiments

In general, the “combined” curriculum learning strategy performed best across tasks. On the program output prediction task, the accuracy varied from 84% for the simplest problems to 36% to the most complex. In the addition task, a 99% accuracy was attained on the addition task for 9 digit numbers. Input reversing and doubling nearly uniformly improved accuracy in the sequence memorization task regardless of the curriculum training strategy used. In all experiments, the training and test tasks were exactly the same; to determine to what extent the network actually learned, for example, to perform addition (with the rules of carrying, for example) could be explored by testing the network on more general tasks after training.

## Why Combined Curriculum Learning Helps

The experimental results suggest that a proper curriculum learning strategy is crucial for attaining good results. This paper hypothesizes that the “naive” approach doesn't work as well for curriculum learning because for the “easier” examples which require less memory, it may be making use of its entire memory, so when the examples become more difficult and require more memory the network isn't able to accommodate. This hypothesis suggests that maybe by exposing the network to harder examples early on (as in the combined case) prevents this from happening.