two_algorithmic_learning_machines

Presenter | Ilya Sutskever |

Context | NIPS 2015 Reasoning, Attention, and Memory Workshop |

Date | 12/12/15 |

Deep neural networks work mainly because they are expressive (they can express very complicated functions) and we can train them, both are necessary. Some neural networks are trainable with backpropagation (though we don't understand why), which depends on both the model and the data. Neural networks can represent the “right” functions, e.g. the functions which solve the problems we care about. Deep functions also can successfully model functions we care about which shallow ones cannot. For example, neural networks can simulate logical circuits, which means that deep enough neural networks can do anything a computer can do in a reasonable amount of time, although it would be surprising to train successfully. Depth and width are therefore both key, but neither are sufficient. Naively built wide and deep models are expensive - they have too many parameters, which means you need too much data to train them. The ultimate solution would be a very deep and very wide model with very few parameters (making it trainable, the difficult part).

A lot of excitement in this area originated from the Neural Turing Machine model, which satisfies these properties by implementing a differentiable computer which utilizes a neural network. The neural network can read and write from the memory in a differentiable way. Since it's a recurrent neural network, at each time step the neural network reads and writes the memory. The whole system is trained with backpropagation.

This model tests whether C-style pointers can be added to Neural Turing Machines. Pointers point to locations in memory, which are natural objects that occur often in algorithms. The goal is a neural network which can learn to manipulate, dereference, and repeatedly follow pointers. A regular neural turing machine can solve this problem, but it is not very easy to learn: In short, in the NTM there are two kinds of addressing mechanisms, content-based and position-based. The position-based addressing can't support pointers; the content-based addressing can but it would require learning the behavior. So, the key idea is a differentiable attention mechanism which works well with pointers. This is implemented by making memory a square with $M$ words, each of which is width $M$. This means you can represent distributions over addresses, which is required when learning with backpropagation. If you were to use a binary pointer, differentiating would not be possible. Accessing the memory by a particular location just involves multiplying it by a pointer vector. Memory updating is also possible by taking the pointer and taking an outer product of it with the update value and adding it to the memory. This results in a system where you can work with pointers in a natural way.

To make it work, the controller must be able to deal with a large memory. The model therefore maintains a large number of registers which are also words of the same size $M$, and the controller can use them when applying operations which are implemented with soft attention. The way the registers are sensed is similar, which avoids having a weight for every dimension of the register. Some of the operations interact with the memory - it can take an address and return a value which says what is in the memory's address, or it can take an address and a value and update the memory. This model can take a pointer, increment it by one, and dereference the incremented pointer, which is exactly what it was designed for. It is therefore capable of problems where pointer manipulation is possible, for example finding the $k$th element of a linked list or determining if a certain element is in the list. In all cases, the model is provided with the entire data structure (including pointers) as input. The model was able to solve all of the problems. The resulting models are extremely deep, requiring very careful hyperparameter optimization.

A model which is very parallel requires fewer steps, which means it is very shallow and can therefore be run very efficiently and can be trained feasibly. The model is therefore a 1-dimensional convolutional which is applied many times to an input, applied over and over until the desired output is produced. We desire models which can deal with differing-length inputs, which the convolution does naturally. The convolution operations are also very efficient. The model is actually a “GRU through depth”, that is, it includes gating across layers - it could have been an LSTM or highway network. For some problems, different weights are used for different layers, and the weights of each layer were cycled through, which results in more parameters and makes the problem easier. To result in a single layer, over the course of training, a cost function is added to make all layers close together. Dropout, gradient noise, gradient clipping and hyperparameter optimization were all necessary to get the model to work. As a test protocol, the model was trained on short instances of a problem, and tested on long instances. It was found that carefully optimizing the hyperparameters resulted in models which could be trained on sequence lengths up to length 20 which were effective on sequences up to length 2000. The resulting model is similar to the grid LSTM, but is more parallel - you can push a lot more data through it. It may therefore be useful for machine translation. It can also be considered a differentiable cellular automaton.

two_algorithmic_learning_machines.txt · Last modified: 2015/12/17 21:59 (external edit)