|Context||NIPS 2015 Reasoning, Attention, and Memory Workshop|
Attention pays a crucial role in human cognition. We receive a huge amount of information and must decide what to pay attention to, e.g. the cocktail party problem. This is only one form of attention - “external” - equally important is internal attention, which allows us to think about one memory at a time. The idea of adding attention to neural networks has been around for a long time, but has only recently found widespread use, probably once it became something to be done to improve performance. Even with no explicit attention mechanism, neural nets learn a form of implicit attention because they learn to respond more strongly to some parts of the data than others. One way to look at this is to look at the network Jacobian, which for sequential data shows which parts of a sequence the network was attending to when it made decisions. For example, in handwriting recognition, it can be visualized that the network is only concentrated on the current word it is transcribing; in reinforcement learning, a recent system was augmented to be able to attend to different parts of input to decide which states it was in and what actions to take.
Explicit attention can be helpful because it limits the data presented to the network in some way - it can be more computationally efficient, more scalable (same amount of input even when the input is larger), it doesn't have to learn to ignore things, and it can also be used for sequential processing of static data (you can turn anything into a sequence - which is full circle from when people used to turn sequential data into fixed-length vectors and statistics). Using “hard” attention usually requires using some kind of reinforcement learning technique, which is required in some settings. However, in other settings it is possible to use a “soft” attention which is differentiable so that we can use end-to-end training. One possibility is to have the attention mechanism to output the parameters of a probability distribution to decide which locations in the sequence to attend to (from Graves' 2013 paper). This is allows for visualization of the alignment of the produced output to the input data. Alternatively, you can select by content, e.g. output a “key vector” which is compared to all data using some similarity function which is then used to compute a probability distribution over the input which decides what to attend to. For example, the input can be embedded in some space, and then the attention mechanism decides which embeddings are attended to. This can even be effective when a location-based attention would seem more natural, such as “very sequential” data.
It's also useful to selectively attend to the network's internal state or memory, e.g. selectively writing rather than writing the data all at once and then deciding what to read. This may make it easier to build complex structures. In the Neural Turing Machine, this is done by first emitting an erase vector and then emitting an add vector. This can be effective for copying tasks, where the number of copies is variable, where it is important to keep track of how many copies have been produced. In a different example, the neural programmer interpreter avoids the fact that the network must learn from scratch how and what to attend to by being told exactly what to attend to at each point in time. This allows training to happen faster by explicitly giving the model the procedure it should use, rather than just lots of training data.
In DRAW, a grid of Gaussian filters is used both to read and write from the “canvas” image, which produces a 2-dimensional memory. The center of the Gaussians, their variance, and the “strength” of the focus is controlled over time. This allows generating images to happen iteratively, either iteratively sharpening an image or starting from a blurry image and making it sharper. The resulting system is compositional - once it learns to generate one thing, it can generate an arbitrary number of them. Related are spatial transformer networks, which parameterize the sampling grid as a spline mapping with affine transformations which allows for a nonlinear warping which can un-distort input images. This works in more than two dimensions, too.