User Tools

Site Tools


adaptive_articulate_and_actionable_deep_learning

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

adaptive_articulate_and_actionable_deep_learning [2015/12/17 21:59] (current)
Line 1: Line 1:
 +{{tag>"​Talk Summaries"​ "​Neural Networks"​ "NIPS 2015"​}}
  
 +====== Adaptive, Articulate, and Actionable Deep Learning ======
 +
 +| Presenter | Trevor Darrell |
 +| Context | NIPS 2015 Deep Learning Symposium |
 +| Date | 12/10/15 |
 +
 +Deep learning has proven to be a flexible system, which can adapt across visual tasks, articulate language, and learn action policies from different inputs.
 +
 +===== Doman Adaptation =====
 +
 +We want to train in one environment (images from the web) and test in another (real-world images from a robot camera). ​ Deep learning does tend to be more invariant to new domains, but certain things can be done to make them adapt better. ​ The most common and simplest is to fine-tune the data with the test data.  However, sometimes there is very few labels or very few labeled examples for the test problem, so ideally we'd like to have a model which explicitly is able to generalize across domains. ​ This is possible by penalizing models which cannot discriminate across domains by adding additional losses which penalize domain specificity,​ which is possible to do without any labels in the test domain. ​ If you do have labels in your test setting, you can use a "dark knowledge"​ approach to take advantage of correlations in the source domain to apply them to a non-hard label loss in the target domain. ​ Combining these two approaches outperforms collecting all the data into the training set.
 +
 +===== Describing Images =====
 +
 +There have been many models and advances in the last year where a CNN processes an image, whose representation is fed into an RNN language model to produce a sequence of words to describe the image. ​ Most of these models are limited in that they can only "​talk"​ about things they'​ve seen in their training data.  This results in objects being called their closest analog in the training data, when they aren't that exactly. ​ However, there are a lot of uncaptioned images (e.g. from Imagenet) and text descriptions about different things, which can be leveraged to separately train across different datasets. ​ Integrating knowledge from these different label collections results in models which have a larger object/​description vocabulary.
 +
 +In a related problem, compositionality (the ability to separate out different processing structures) is important for visual question answering. ​ To integrate this into typical VQA systems, a language-based parser can be added to models, so that compositional questions can be questions which are difficult for existing models to solve. ​ This also can provide gains in traditional VQA datasets. ​ Natural language can also be used to make queries about an image, by encoding additional information such as local features, spatial configurations,​ and global context.
adaptive_articulate_and_actionable_deep_learning.txt ยท Last modified: 2015/12/17 21:59 (external edit)