from_attention_to_memory_and_towards_long-term_dependencies

Presenter | Yoshua Bengio |

Context | NIPS 2015 Reasoning, Attention, and Memory Workshop |

Date | 12/12/15 |

Looking at machine translation as a problem of mapping a sentence to an intermediate representation which is fed to a generative model works reasonably well, but for longer sentences it doesn't work that well. Using an attention mechanism allows the generative model to select certain sequence steps when it is generating by using an additional small neural network to compute a weighting for each sequence step. This follows an earlier idea by Graves to use a location-based mechanism to compute a soft attention mechanism, but extends it to use a content-based mechanism. Combinations of location and content have also been more recently used. For machine translation, this achieved state-of-the-art for certain language pairs. It has also been applied (with exactly the same code) to “translate” from images to english, i.e. caption generation. When using attention over images, it makes it simple to determine what part of the image the generator is “looking” at when it generates different words. The result is a “weakly” supervised method for learning what different objects are. Looking at this output makes it more straightforward for debug, too.

One of the reasons the attention mechanisms help is connected to the problem of long-term dependencies (that is, when learning a composition of many nonlinear functions by measuring a loss which depends on all of the nonlinearities, the derivatives can become very small or large depending on the eigenvalues of the Jacobians). If you want a recurrent network to store information reliably, you need some kind of attractors in the dynamics (Jacobians with eigenvalues less than 1). The problem with that is that if they are contractive, it also means that you will have gradient vanishing. So, the condition that requires that RNNs are learnable seems to imply that you must be forgetting things. One of the paths to improving this were LSTMs, which introduces loops in the state-to-state transitions where the derivatives are slightly less than one. An alternate early approach are skip connections or a hierarchy of timescales.

Considering a memory content which be read and written to, for many of the locations in the memory at each time step very little will happen because a softmax is used which usually only selects a few elements. This is similar to how in an LSTM the memory is copied across time. For this to work, however, the memory needs to be large enough so that it doesn't have to read and write from the same locations very often. The idea of a “copy” (preserve information) can be generalized a little bit to an operation where the eigenvalues of the Jacobian are 1. This all suggests that networks with large memories are valuable, but these make the networks more expensive.

In unitary evolution recurrent networks, the idea of orthogonal/identity initializations is generalized to unitary matrices by avoiding a projection back to unitary matrices. The resulting system can guarantee that the gradients will not explode. To build such a system, the fact that you can multiply unitary matrices and receive a unitary matrix is exploited. However, decomposing a unitary matrix into two matrices would result in $N^2$ operations for $N$ parameters. So, it takes advantage of the fact that complex-valued matrices can be manipulated in the Fourier domain, to speed up computation. Because the neural networks are complex-valued, the nonlinearities used must be able to preserve phase, so a ReLU which is only applied to the amplitude is used. In toy problems, the unitary RNNs were able to learn a copy task for long time lags which was unattainable for other architectures. In the addition problem, it matched the performance of LSTMs. In the sequential and permuted MNIST tasks, the unitary network converges to a good solution quickly. By inspecting the gradients over time in the different architectures, it was observed that the unitary RNN was able to have gradients which decayed more slowly over time. The norm of the hidden states also seems to not saturate.

from_attention_to_memory_and_towards_long-term_dependencies.txt · Last modified: 2015/12/17 21:59 (external edit)