User Tools

Site Tools


adaptive_articulate_and_actionable_deep_learning

Adaptive, Articulate, and Actionable Deep Learning

Presenter Trevor Darrell
Context NIPS 2015 Deep Learning Symposium
Date 12/10/15

Deep learning has proven to be a flexible system, which can adapt across visual tasks, articulate language, and learn action policies from different inputs.

Doman Adaptation

We want to train in one environment (images from the web) and test in another (real-world images from a robot camera). Deep learning does tend to be more invariant to new domains, but certain things can be done to make them adapt better. The most common and simplest is to fine-tune the data with the test data. However, sometimes there is very few labels or very few labeled examples for the test problem, so ideally we'd like to have a model which explicitly is able to generalize across domains. This is possible by penalizing models which cannot discriminate across domains by adding additional losses which penalize domain specificity, which is possible to do without any labels in the test domain. If you do have labels in your test setting, you can use a “dark knowledge” approach to take advantage of correlations in the source domain to apply them to a non-hard label loss in the target domain. Combining these two approaches outperforms collecting all the data into the training set.

Describing Images

There have been many models and advances in the last year where a CNN processes an image, whose representation is fed into an RNN language model to produce a sequence of words to describe the image. Most of these models are limited in that they can only “talk” about things they've seen in their training data. This results in objects being called their closest analog in the training data, when they aren't that exactly. However, there are a lot of uncaptioned images (e.g. from Imagenet) and text descriptions about different things, which can be leveraged to separately train across different datasets. Integrating knowledge from these different label collections results in models which have a larger object/description vocabulary.

In a related problem, compositionality (the ability to separate out different processing structures) is important for visual question answering. To integrate this into typical VQA systems, a language-based parser can be added to models, so that compositional questions can be questions which are difficult for existing models to solve. This also can provide gains in traditional VQA datasets. Natural language can also be used to make queries about an image, by encoding additional information such as local features, spatial configurations, and global context.

adaptive_articulate_and_actionable_deep_learning.txt · Last modified: 2015/12/17 21:59 (external edit)