|Context||NIPS 2015 Oral|
In the standard learning scenario, the learner receives a label sample which is assumed to be drawn iid from some unknown distribution, and then predict the labels of unseen samples drawn iid from the same distribution. In time series prediction, the learner receives a realization of a stochastic process and must predict the next value. The values received may not be iid. A standard assumption is that the underlying stochastic process is stationary (does not change with time). It is also assumed that the dependency of a current sample to a sample in the past decays for samples further into the past (mixing). Both of these assumptions consider only the distribution of the underlying stochastic process and ignore other key components of the learning problem such as the loss function and hypothesis test, and there is no way to test them and they may be invalid. For example, if you flip a coin at the beginning of time, and if heads the stochastic process is a sine and talks indicates a cosine, the resulting process is non-stationary and is not mixing, so existing theory states that learning is impossible. Many real world processes of interest are either non-stationary or non-mixing; most processes with a trend or a periodic component are not stationary. A natural question is whether learning is still possible in these cases.
Given a sample, the hypothesis set of models, and a loss function, the goal is to seek a hypothesis which has the smallest expected lost in the near future conditioned on the history we observed (path-dependent generalization error). To avoid assumptions on stationarity and mixing, there are two ingredients required for the analysis: Expected sequential covering numbers (natural extensions of classical learning theory notions to sequential data), and discrepancy measure (the largest possible discrepancy between the target and sample distributions in the near future). This provides a natural measure of non-stationarity, because it will be zero if the process is stationary or if the process is weakly stationary. Under some additional mild assumptions, it can be estimated from data.
It can be shown that with high probability path-dependent generalization error is bounded by the empirical error, the discrepancy, and a term which captures the complexity of the hypothesis set being used. Since we may not have stationarity in the data, it may make more sense to emphasize certain sample points more than others. We can therefore use a probability distribution over sample points. Unfortunately, in general estimating discrepancy in general is hopeless because it depends on the distribution of the future we are trying to fix. But, it can be upper bounded by the discrepancy of the recent history and the whole sample, and the recent history and the future we are trying to predict. The former can be estimated from the data and the latter can be assumed to be small. The resulting guarantees provides a natural objective, which includes an L2 loss, a discrepancy term, and regularization terms controlling the complexity of the hypothesis and the deviation of the sample-weighting distribution from the uniform distribution. Using duality, this objective can be formulated in a convex way.