|Presenter||Jeff Dean and Oriol Vinyals|
|Context||NIPS 2015 Tutorial|
Google Brain started as a project to push the state-of-the-art results by scaling up the size of the models and what could be trained on. Initial experiments were on “perceptual tasks”; currently research is broadly on scaling up experiments. Initially, in 2012, there wasn't a lot of deep learning research at Google (mostly unsupervised learning), but adoption has grown substantially and now teams across Google are using it (and the distributed systems Google Brain is developing). Part of this is thanks to the fact that the systems/techniques developed are flexible with respect to their input and objective. These “universal” approaches tend to work better than other techniques, too. Most of the success is in supervised approaches, where a large amount of raw data is fed into the model and the model's parameters is optimized to, for example, classify images. This approach also deals very natural with sequential data, with recurrent neural networks. With smaller datasets, pre-training on a larger supervised task can help. A benefit of the flexibility of these approaches is that it's useful across domains, so it forms a sort of common language. To motivate the need for the need for valuable tools for this research, it is commonly observed that as you scale up the size of the training data and the size of the model (independently or combined together), the results generally get better.
The first system developed for these large-scale experiments was DistBelief, which was a very scalable system which was very effective for production but was not flexible for research - it was mainly targeted at deep feedforward networks trained with SGD. The second current generation is TensorFlow, which is much more flexible to different research tasks, and is also more portable across different platforms and open source so anyone can use it. These tools are meant to be able to build large systems and train them quickly - such experiments require a lot of computation, as well as flexibility with respect to experiments.
One of the earliest experiments done by Google Brain was a very large-scale experiment to do unsupervised learning from YouTube images (millions of them) using an autoencoder for reconstruction; the system was able to find templates for common images with no supervision. Then, they focused on acoustic modeling for speech, first with deep feedforward networks and then with stacks of convolutional, recurrent (LSTM), and deep feedforward networks. Extending on that, people are now interested in replacing the entire speech pipeline (including the language model, not just the acoustic model) which allows for end-to-end training, and systems which are easier to express. Also early on, after the ImageNet 2012, convolutional networks for image recognition/processing have become a focus. Interestingly, Google's improvements to these systems (Inception/GoogLeNet) have much fewer parameters because they remove fully connected layers. One motivation for these smaller models is that they can be more easily run on smaller systems (mobile phones). Inception v3 was just released as a pre-trained model for TensorFlow, so it can be used to rapidly make predictions about images, even on local machines.
To do novel research, it's important for it to be easy to express all of the new models and training techniques people are interested in implementing. But, the experiments should still be scalable, and the result should still be portable (runnable on a wide variety of systems). It should also be freely accessible, so that results are reproducible. These issues were not all addressed by DistBelief, so Google started developing TensorFlow. Google released the 1-code version of TensorFlow on November 9th, with a great deal of documentation.
The core of TensorFlow is an architecture for running computational graphs on a variety of platforms (CPU, GPU, computer clusters, different operating systems…) Models are expressed as TensorFlow graphs, and TensorFlow graphs are (at this time) specified using Python or C++ (though new front-ends are in development). A computational graph is a dataflow graph, consisting of variables and operations as nodes. Some operation notes can compute gradients. TensorFlow is designed so that portions of the graph can be split out onto different devices. Whenever an edge crosses a “boundary” between devices, the edge is replaced with a send and receive node. The send and receive node implementations capture all of the data transfer in the system (GPU copies, cross-machine RPC, RDMA for GPUs across different machines, etc). The core system defines a number of standard operations, but it's easy to define new operators and/or kernels. The main other interface in TensorFlow is the session interface, which allows parts of the graph to be executed (optionally feeding in certain things as input and receiving things as output). Typically, building the graph is more slow/“contemplative” to result in an efficient graph, while running it happens thousands or millions of times so it is designed to be as efficient as possible. A client process communicates with a master process, which communicates with workers to execute portions of the graph. A run call can be made so that certain variables are given a certain assumed value, and only some outputs are requested - in this case, only the subgraph which includes these values/operations is included.
In this example, the eigenvector of a matrix is computed using the power method. First, an interactive TensorFlow session is created with an empty graph. Variables are then defined; when a variable must be given a value in order for a graph to be executed, they are defined as placeholders. Then, the graph is built up using TensorFlow operations, e.g. x = tf.matmul(W, x). The graph can be given logging operations. Then, the graph is run, supplying values for the placeholders. The graph can be visualized using “TensorBoard”, which shows operations, variables, etc. in a graph structure. TensorBoard can also collapse/expand portions of the graph for ease of visualization. To compute gradients, tf.gradients can be called with arguments specifying what computed value should have their gradient computed and what the gradient should be computed with respect to. Gradient descent can then also be added as part of the graph, or the gradient descent can happen between run calls, outside of TensorFlow. Gradients get replaced by their symbolic expression. These expressions can be computed wholesale, but only the gradients which are required are actually computed when running. The execution engine also does some set of optimizations “under the covers” to replace/merge parts of the graph.
When initially released, the single-core version of TensorFlow was substantially slower than other libraries (e.g. Torch), partially because some indexing operations were slower after compilation and partially because TensorFlow currently uses cuDNN v2. Since then, some of the compilation issues were improved so that the speed matches other libraries which use cuDNN v2 (which is still substantially slower than cuDNN v3).
More of Google's focus is on large-scale distributed system performance, which can have a much larger speed improvement than working on single-machine performance. Running on a very large cluster can allow experiments which would typically take days take hours, for example, which facilitates faster prototyping and research. The best way to decrease training time (largest time spent in a machine learning pipeline) is to decrease the step time. One of the main problems with distributing work is to do it in a way that communication doesn't kill you - single-core is free (SIMD), cross-core is cheap (inter-socket bandwidth), cross-device can be expensive (PCIe bandwidth), cross-machine can be very expensive (cross-machine).
One way to make things parallel is to be able to partition independent parts of the model, e.g. in a convolutional network slicing the convolution across different devices. This can be the most effective way of getting parallel training. The other approach is to use data parallelism - using multiple model replicas to process different examples at the same time, and then merge the results of all of the models. Each model retrieves parameters from a centralized parameter server, computes its parameter updates, and then reports the updates back to the parameter server. Done “synchronously” (all gradients/gradient updates happen at the same time), using N replicas is equivalent to using an N-times larger batch size, which prevents gradient staleness but requires recovery if any single machine fails. Done “asynchronously”, gradients may become “stale”, but this may not matter in practice. The speedup attained can depend highly on the kind of model. Ideally, you want the model computation time to be large relative to the send/receive parameter time. Models tend to re-use parameters multiple times when computing their output, which is helpful. Convolutional models re-use parameters for different spatial positions, and recurrent models re-use parameters at each time step. This allows less data to be transfered with respect to how much computation must happen. Data parallelism is very important for very large models with large datasets, and can produce much better results much faster.
TensorFlow is designed to be able to express either model or data parallelism very easily. It needs to be able to decide which decide is going to execute each node in a graph. The heuristics used are not perfect, so the user can supply device constraints (hints as to where each computation should take place). TensorFlow then uses the hints and a cost model (node execution time and tensor size estimates) to decide where to run certain computations. TensorBoard can be used to visualize where different computations are happening. As an example, in a convnet used for CIFAR10 on two GPUs and a CPU, two convolutional “towers” are computed on each GPU, then the cost and parameters is stored/computed in the CPU.
LSTMs are powerful recurrent models, but their update equations are complicated (multiple “gates” computed at each time step), so implementing their gradients (backpropagated through time) by hand can be prohibitively expensive. By using a computational graph, the code and the math equations can have essentially a one-to-one mapping. When gradients of operations are known by the graph, gradients don't have to be coded by hand. One valuable application of LSTMs is sequence-to-sequence learning, which allows for sequences to be transduced (e.g. machine translation) in an end-to-end trainable system. The fact that LSTMs can store “memory” over time helps with this - the input sequence is encoded as a fixed-length vector representation, which is then used as the initial state of a decoder. One example of a sequence-to-sequence transduction task is image captioning, where the encoding is done using a convolutional network instead of a recurrent model. Using a system where gradients can be computed automatically allows prototyping these architectures much simpler. LSTMs can also be used for sequence prediction (e.g. predicting the next line of a movie dialogue); research at Google on this task led to the Smart Reply system. Deep LSTMs are important for achieving good results on this task, and model parallelism can be helpful for achieving speedups by putting different layers on different GPUs. Splitting layers across GPUs is very simple in TensorFlow - you just give it per-layer hints as to which device to use.
In order to make a non-parallel system into a data parallel system, a few changes must be made to the code. First, the tf.device must be initialized with a tf.ReplicaDeviceSetter, and a tf.Supervisor must be used which keeps track of the number of steps. This creates separate parameter devices, from which parameters are requested. To speed up transitions across devices, a simple graph transformation is to use floating point schemes with fewer bits. Optimizations can also be made for very sparse data which must be transferred. After training, for inference, precision is even less important, so even lower precision can be used in production on lower-powered devices.