|Context||Columbia Neural Network Reading Group/Seminar Series|
Making incremental improvements on individual tasks (chess, object recognition) won't create an intelligent system. Instead, the tasks focused on should be more general - like learning entire algorithms, programs, mathematics, etc. If we were to deploy a robot in the real world, there's no way it would currently succeed because it would only be very good at a small number of tasks. If we focus first on more general, simple tasks, we can build up to a more general model of the world.
One example is learning to evaluate simple computer programs, i.e. ones with complexity on the order of the number of characters used to represent the program. We are essentially trying to learn the algorithm behind the interpreter (variable assignment, for loops, constants, etc) by feeding the program character-by-character into a model and then predicting the output. This particular task is a nice simple but general one because the model must model long-distance dependencies, be able to store things in memory, handle branching, and carry out multiple subtasks in the course of the full task. In this task, you don't need to produce an output character for each input character; we are simply producing a single output for the entire input sequence.
An RNN with LSTM was used with 2 layers, 400 units each, trained with stochastic gradient descent with a cross-entropy error. The input vocabulary is 42 characters; the output is 11 (all numerals and a dot to indicate end of sequence). LSTM was chosen because it can model long-range dependencies and store memory, but more importantly they can be trained much easier than a normal RNN. A conventional RNN has a hidden state which evolves through matrix multiplication (so the hidden state can change very rapidly); the LSTM has ways to store the hidden state so that it can be preserved. The model was trained until there was no improvement on a validation set. Teacher forcing was used so that each output character was always conditioned on the correct previous output character.
Only programs which can be solved with a single left-to-right pass, involving addition, subtraction, multiplication, variable assignment, if-statement and for-loops were used. The maximum operation nesting and constant length were varied to adjust the complexity of the program.
In curriculum learning, the model is first exposed to simple examples, then subsequently more and more complex ones, in hopes that it will make it easier to learn the more complex examples. In the naive strategy, smaller nesting and length programs were used, then each parameter was increased. In the mixed strategy, the complexity was randomly chosen for each example. In the combined strategy, either the naive strategy or the mixed strategy was chosen at random at each step.
Even without curriculum learning, the model was able to predict some programs exactly, but it often predicted leading and last digits correctly. The combined curriculum strategy tended to work the best; it was always better than using no curriculum learning. Using the naive/classical curriculum learning strategy actually made the accuracy a little worse than not using curriculum learning for simpler programs. For simply adding 9-digit numbers, the model could achieve 99% accuracy, but they were not able to generalize to longer sequences - they learned a new algorithm for each position in the sequence. Although the training, validation, and test datasets were disjoint, it's still not clear how much memorization was present.
Proving mathematical problems is potentially more difficult than learning to evaluate programs because the possible space of problems is much bigger. Once a proof is proposed, however, it can still be verified as correct. Finding the proof itself however requires a search over all possible combinations of operators, which is intractable for all but simple proofs. In some sense, mathematicians have learned the prior of proofs and use it to prove new identities, which prevents them from having to search over all possible operators. Mathematical identities can be expressed as trees of operations; when trees are equivalent some trees may be more or less efficient than others.
Given a grammar of operations and some target expression, we compose the operations in a tree form to compute the expression. This ork only focuses on matrix/vector elementwise) multiplication, transpose, and column/row sum/repetition are allowed. In this family, you can represent very complex functions. For example, the Taylor series expansion for some order of the RBM partition function can be converted from an exponential sum to a polynomial sum because of the constraints (binary) on the arguments. Representing the expressions purely symbolically was too slow, so instead random numbers were used for a fixed allowed matrix dimensionality. In this framework, the problem can be stated as finding an expression using only operators of lower complexity whose descriptor matches; to do so, a prior over the expressions is used. … The scorer can be naive - that is, just choose randomly from all valid operations; it can memorize small n-gram chunks which were useful in the past; or it can be learned as a neural network tree. A curriculum learning strategy was used, where the complexity of the example is its polynomial degree, so that the prior for higher polynomial degree was based on the prior from smaller ones. This makes it so that “chunks” are shared between solutions at different orders. 5 families of expressions were included, being multiply-sum, element-wise multiply-sum, symmetric polynomials, and RBM partition functions. The order of the expressions was increased, an explicit time cut-off was set, and for each order the experiment was repeated 10 times because of the random sampling. For some expression families, the 5-gram model was able to learn the correct inference technique (e.g. induction step). For the 2-sided RBM partition function, only a taylor expansion up to degree 5 was able to be found by the n-grams and naive strategy. The n-gram models however can only have a shallow understanding, i.e. up to n. As an alternative, a tree neural network was trained which is trained to understand more and more complex mathematical expressions. Through exhaustive search, a class of many examples equivalent examples was generated and fed into the tree neural network to learn equivalencies. Tree neural networks take as input trees as training data instead of sequences. By splitting these expressions into equivalence classes, the network was able with very high accuracy to predict whether two trees computed the same thing or not. This is without encoding any explicit mathematical knowledge - only equality of different operators. To build a prior, solutions were taken from lower degrees and fed into a softmax layer to compute scores for a higher degree.