|Context||NIPS 2015 Oral|
Good models exist for some structures, such as RNNs for temporal data, and convolutional networks for spatial structures, but they still struggle with some dependencies in the data such as when the data must be accessed out of order, or very long-term dependencies. For example, in question answering tasks where a question is asked about a simple sequential story describing actions carried out by different people, the sentences may arrive out of order and there may be many sentences between relevant ones. This work proposes a neural network model with an external memory, from which it can read with soft attention, can perform multiple lookups, and is end-to-end trainable with backpropagation. The original memory network had hard attention, and required explicit supervision of attention during training, which is only available for simple tasks. The end-to-end memory network extends this with a soft attention, so that only supervised information is needed for the output. The memory module consists of a variable-size set of vectors, and a controller module determines where and when to read and write from the memory. The addressing signal (state vector) from the controller is dotted against the entries in the memory, and the result is passed through a softmax to get a probability distribution over the memory locations. The resulting vector read out of the memory network via a weighted average. To create the memory vectors, the input may be embedded in some space and stored. A simple way to include order in the memory would be to include the time in the embedding input. Using a dot product of the controller state against the memory allows finding of similar entries. The resulting system is similar to attention-based models. In experiments on bAbI, a synthetically generated Q&A toy dataset, the end-to-end memory network achieved good results, but not as good as the original strongly supervised memory network. In a language modeling task, where the next word must be predicted, the end-to-end memory network achieved better results than an LSTM for some memory sizes/hop sizes.