|Context||NIPS 2015 Invited Talk|
A model describes data that one could observe from a system, and we measure how good it is based on this. We can use the mathematics of probability theory to express forms of uncertainty about our model, which allows us to use Bayes rule as a way of doing inference about hypotheses from data. Learning and prediction can then be seen as forms of inference. The beauty is that everything falls from two simple rules of probability theory: The sum rule $P(x) = \sum_y P(x, y)$ and the product rule $P(x, y) = P(x)P(y | x)$. Learning is then computing the probability of a model's parameters given the data and the model, based on the likelihood of the parameters in the model, the prior probability of the parameters, and the posterior of the parameters given the data. We also need a way to compare models. For example, given some data we can fit a polynomial with various orders - some of them may underfit the data due to lack of parameters, and some may overfit due to trying to fit too many parameters. The same sort of comparisons can be made between different clustering methods, dimensionality reduction, etc. To do this in the Bayesian framework, we can compare model classes by computing the posterior probability of the model given the data using the sum and product rule, which gives the probability of the model over the data averaging over all possible parameter values. Different models of different complexities place different distributions over the same dataset. If we observe a particular dataset, simple datasets may be unlikely to observe it, complex datasets may have too much likelihood to observe other ones. In some cases, it's straightforward to compute the evidence, but for more complicated models it can be very expensive. Fortunately, there's a very well-developed series of approximations for doing intervals, including Laplace approximation, Bayesian information criterion, variational approximations, etc.
We can do a lot of machine learning without the probabilistic framework, but many aspects of learning depend critically on the careful probabilistic representation and/or availability of an interpretation of uncertainty - for example in forecasting, decision making, learning from limited/noisy/missing data, data compression, etc.
Probabilistic modeling can result in unrealistic models; Bayesian Nonparametrics mitigate this by defining infinite-dimensional probabilistic models using tools from stochastic processes. It's a simple framework for modeling from complex data - it just follows from the sum and product rule, although the models we use will be rich. We can't do optimization over an infinite number of parameters, which is why it's useful to do Bayesian inference. The key to these models is that their complexity grows as you receive more data. Some examples: Gaussian process vs. polynomial regression for function approximation, gaussian process classifiers instead of logistic regression for classification, Dirichlet process mixtures vs. mixture models/k-means for clustering, etc.
Say you want to learn a nonlinear function with error bars over some data. Gaussian processes allow you to define a distribution over functions, which can be used for Bayesian regression. What makes it feasible is that a Gaussian process is defined by the fact that any finite subset of points the marginal distribution of the function values is a multivariate Gaussian distribution. This allows us to do all of our operations on finite dimensional vectors. They are applicable whenever you need to estimate a function. A neural network with a single hidden layer with infinitely many hidden units and Gaussian priors on the weights converges to a Gaussian process, shown by Neal in 1994, which allows us to see what the class of functions captured by the neural network is.
Probabilistic model development and derivation of inference algorithms is difficult, time consuming and error-prone. It's usually done by hand. Probabilistic programming offers a beautiful and elegant way of automatic the process of deriving inference algorithms. Probabilistic programming languages allow probabilistic models to be expressed as computer programs which generate data. In this general format (much more general than a graphical model), you can develop universal inference engines for this language which do inference over the hidden states of the program given some observe data. The engine can then reason about the observed data, given the probabilistic program. They aren't efficient now, but they are getting more efficient and may replace derivation by hand. In the universal inference engine, there are many inference algorithms possible.
Bayesian optimization attempts to find a global optimum to a black-box function which is expensive to evaluate. Black box means the only way you can access it is by providing it with values and getting a value back. Because the functions are expensive to evaluate (e.g. real-world experiments), you want to be able to optimize the functions in as few evaluations as possible. The key idea is to act rationally. Bayesian optimization treats this problem as sequential decision-making and models uncertainty in the function. It can also be done with constraints.
Because we often produce more data than we can store or transmit, so we need good methods for using as little space as possible for some data. By Shannon's source coding theorem, all compression algorithms are implicitly based on probabilistic models. Developing better sequential, adaptive, nonparametric models of data which allows us to predict the data better making it (on average) cheaper to store or transmit (very explicitly in some cases). For example, hierarchical Bayesian models can be used to learn and predict symbol occurrences in sequences, which produces cutting-edge compression results for human text.
Data is ubiquitous, and there's a lot of value in understanding all of the data, but there isn't enough human power to analyze all of the data. It would therefore be useful to automate some aspects of data analysis by developing a system for automating model discovery from data. To achieve this, we need an open-ended language of models, a search procedure, a way to evaluate models, and a way to explain the resulting models. For example, the language could be different kernels, which could be combined to create mode complex kernels, and perform search by looking at the marginal likelihood as models are combined, and produce an executive summary of the results based on the model architecture used. This also produces models with state-of-the-art predictive performance. We want the automatic statistician to be self-critical, asking the question of whether the data matches the assumptions of the model.
Many problems in machine learning and AI require evaluating a large number of models on potentially large datasets, which is potentially very expensive to do. We need a way to consider the tradeoff between statistical and computational efficiency. One way to do this is to be able to compute how well a model would perform if it was run for longer time. This, combined with the automatic statistician, allows automatically design machine learning methods which can automatically learn from data in as little time as possible.