User Tools

Site Tools



This shows you the differences between two versions of the page.

Link to this comparison view

how_the_hell_do_we_do_unsupervised_learning [2015/12/17 21:59] (current)
Line 1: Line 1:
 +{{tag>"​Talk Summaries"​ "​Neural Networks"​ "NIPS 2015"​}}
 +====== How the Hell do we do Unsupervised Learning? ======
 +| Presenter | Yann LeCun |
 +| Context | NIPS 2015 Workshop on Statistical Methods for Understanding Neural Systems |
 +| Date | 12/11/15 |
 +The recent success of deep learning is mostly due to supervised learning, which requires a very large number of examples. ​ The kind of learning humans and nimals do is mostly unsupervised,​ discovered mostly by looking around and walking around and figuring out the constraints of the physical world. ​ It would be great to be able to build a model which models that the world is three dimensional,​ that certain objects move, that objects in the air fall, that two objects can't be in the same place at the same time, etc., all of which are learned by humans. ​ This is analogous to how most of physics is still unknown - most of learning is unsupervised.
 +===== Integrating Supervised and Unsupervised Learning with Stacked What-Where Autoencoders =====
 +Boltzmann machines have nice properties in that they combine supervised and unsupervised/​generative and discriminative learning. ​ You want a system which is trained to be able to learn the distribution of the input. ​ One way to do this is to use a transformation which compresses, but this can be problematic because you may need to map one thing to many things. ​ One option would be to sample from the distribution,​ rather than producing a one-to-many mapping. ​ An alternative would be to keep complementary information (which is dropped during compression),​ so that during reconstruction we can re-combine that information to reconstruct the input. ​ This has been done by Ranzato, Gregor, and Hinton. ​ One option is to use a convolutional network with a sort of pooling and unpooling; the maxpooling will compress information so it is replaced by a "​soft"​ maxpooling and a soft unpooling. ​ This is achieved by using a softmax instead of pooling, where the softmax is multiplied against the pooling window. ​ Unpooling is then achieved by a sort of soft argmax, which produces the location. ​ This allows you to "​paint"​ the location in a certain location when doing unpooling, and is completely differentiable. ​ The unpooling generates sparse representations,​ so that the encoder produces a sparse representation which must be reconstructable. ​ This whole system can be stacked and trained all at once, with a loss at every layer, potentially with a classification loss at the final layer. ​ The resulting system can reconstruct the inputs after compression much more effectively than upsampling, and is analogous to some neural models. ​ It's also related to the "​capsule"​ idea by Hinton. ​ It can produce good results for limited-training-size MNIST, but not as well as the Ladder Network. ​ On STL-10, CIFAR-10, and CIFAR-100, it achieves almost state-of-the-art results.
 +===== Unsupervised Learning Through Temporal Prediction =====
 +There is a lot of information in video which is useful. ​ In this system, two shallow sparse (L1 penalty) ReLU autoencoders look at two successive frames of a video, and try to be invariant to transformations which occur between successive frames. ​ By grouping filters together, very nice filters can be learned, but using it to improve a classifcation task has not been tried. ​ Ideally, some changes in the input space should correspond to linear changes in the feature space to allow for interpolation. ​ Training is on 3 successive frames of video, where the third frame is predicted from a linear extrapolation of a learned (unsupervised) representation of the first two frames. ​ Again, all pooling is the same "​soft"​ pooling and unpooling. ​ The resulting filters also look nice, and can be used to predict rotations of simple images via simple linear interpolation - with a single layer system!
 +===== Video Prediction with a Spatial-Temporal ConvNet and Adversarial Criterion =====
 +Video, text, speech, etc. prediction is a good unsupervised criterion, because being able to predict what happens next is fundamental to intelligence. ​ Teaching machines to predict may help achieve this for machines. ​ However, the world is intrinsically unpredictable,​ and being able to predict all possible futures is infeasible. ​ One possible solution is to use latent/​hidden variables (the stuff you can't observe which makes the world unpredictable). ​ Another way to handle it is to use a criterion which ignores the difference between the possible outcomes, the errors in the plane of all the possible things which can happen. ​ This latter trick is encapsulated in adversarial training, which allows for better predictions of the future. ​ The adversarial training attempts to classify real-world data and the prediction of the generative model. ​ The predictor can cheat and get the gradient of the discriminator,​ which makes it better at fooling the predictor. ​ In addition, a new gradient criterion can be used which focuses on the edges in images instead of the whole image.
how_the_hell_do_we_do_unsupervised_learning.txt ยท Last modified: 2015/12/17 21:59 (external edit)