|Context||NIPS 2015 Deep Learning Symposium|
The basic idea of the neural turing machine is to turn a neural network into some sort of differentiable computer by giving it read and write access into a memory. Hopefully, with a combination of these two things, things that a computer can do can be learned in a supervised manner with backpropagation. It also explicitly separates computation and memory. The architecture consists of a controller (feedforward or recurrent) network, which receives input and produces output using information from the memory, which it can also write to. The memory is a real-valued matrix. The controller should focus on certain parts of the memory to read and write to. This attention mechanism is realized by having the controller output a distribution over the rows in the memory. Content-based lookup involves having a neural network output a “key” vector, which is compared to the vectors in the memory with some similarity measure and the resulting similarities are normalized (e.g. with softmax). An additional sharpness parameter can be added to the softmax, which multiplies the similarity scores and therefore “sharpens” the focus of the controller on a small piece of memory, which decides how precise the controller is when looking up items in the memory. In addition, the controller can address by location, irrespective of content. This is achieved by the controller outputting a “shift kernel” which is convolved cyclically with a weighting to produce a shifted weighting. This gives a way to take a weighting that was already generated and “push” it up or down to access different parts of the memory. Using these two addressing technique gives the controller different “modes” of accessing memory. If the controller is only emitting a contact key, the memory is accessed like an associative match. If content and location is specified, the memory is being treated like an array (key → head of array, shift → index). Once the weightings are computed, the reading is simple - a weighted sum over the memory rows is computed. Writing is modeled after the input and forget gates of an LSTM network. So, the write head receives an erase vector an an add vector, which determines how much of the memory to keep and how much to replace it.
In a copy task (read in a random sequence, then output it), the NTM behaves exactly how you might expect it to - it focuses explicitly on locations in memory, then focuses exactly on those sequences to write them out. It also generalizes to longer lengths; in their experiments, training on 10 to 100 more or less continued worked at 120 steps. It also worked in a task where the input sequence should be copied N times. In an n-gram task, the NTM learned to use specific memory locations to count the number of occurrences. In recent work, the NTM is able to learn algorithms on graphical data such as the shortest path, where it can be observed to iteratively finds paths which are farther and farther away from the start/end nodes, storing the path in memory.