[1912.08866] Continuous Meta-Learning without Tasks

Meta-learning is a promising strategy for learning to efficiently learn within new tasks, using data gathered from a distribution of tasks. However, the meta-learning literature thus far has focused on the task segmented setting, where at train-time, offline data is assumed to be split according to the underlying task, and at test-time, the algorithms are optimized to learn in a single task. In this work, we enable the application of generic meta-learning algorithms to settings where this task segmentation is unavailable, such as continual online learning with a time-varying task. We present meta-learning via online changepoint analysis (MOCA), an approach which augments a meta-learning algorithm with a differentiable Bayesian changepoint detection scheme. The framework allows both training and testing directly on time series data without segmenting it into discrete tasks. We demonstrate the utility of this approach on a nonlinear meta-regression benchmark as well as two meta-image-classification benchmarks.
‹Figure 1: An illustration of a simplified version of our problem setting and of the MOCA algorithm. We observe a time series of data, in which the data x is presented (in this case, an image), based on which a probabilistic prediction is made (which we denote ŷ), after which the true label y is received (in this case, class labels taking value 1 or 2). Unobserved change in the underlying task (here labeled “changepoint” result in changes to the generative model of x and/or y. In the above image, the images corresponding to label 1 switch from sailboats to school buses, while the images corresponding to label 2 switch from sloths to geese1 . MOCA recursively estimates the time since the last changepoint, and conditions an underlying meta-learning model only on data that is relevant to the current task. (Problem Statement)Figure 2: The performance of MOCA on the sinusoid regression problem. Right: The belief over run length versus time. The intensity of each point in the plot corresponds to the belief in run length at the associated time. The red lines show the true changepoints. Left: Visualizations of the posterior predictive density corresponding to the blue dotted lines in the figure on the right. The red line denotes the current function (task), and red points denote samples from that function. Green points denote data from previous tasks, where more faint points are older. a) A visualization of the posterior at an arbitrary time. b) The visualization of the posterior for a case in which MOCA did not successfully detect the changepoint. In this case, it is because the preand postchange function (corresponding to figure a and b) are highly similar. c) An instance of a multimodal posterior. d) The changepoint is initially missed due to the data generated from the post-change function being highly likely under the previous posterior. e) After an unlikely data point, the model increases its uncertainty as the changepoint is detected. (Related Work)

Figure 3: Performance of MOCA versus baselines in sinusoid regression (left; lower is better), Rainbow MNIST (center; higher is better), and miniImageNet (right; higher is better), versus hazard rate. Note that for both problems, MOCA always outperforms the baselines and the performance degrades only slightly from the performance of the oracle. In contrast, sliding window methods result in severely degraded performance. (Experimental Results)Figure 4: Performance change from augmenting a model trained with MOCA with task supervision at test time (violet) and from using changepoint estimation at test time for a model trained with task-supervision (teal), for sinusoid (left), Rainbow MNIST (center), and miniImageNet (right). (Experimental Results)Figure 5: Performance versus the training horizon (T) for the sinusoid with hazard 0.01. The lowest hazard was used to increase the effects of the short training horizon. A minor decrease in performance is visible for very small training horizons (around 20), but flattens off around 100 and above. It is expected that these diminishing marginal returns will occur for all systems and hazard rates. (Probabilistic Clustering for Online Classification)Figure 6: Test negative log likelihood of MOCA on the sinusoid problem with partial task segmentation. The partial segmentation during training results in negligible performance increase, while partial supervision at test time uniformly improves performance. Note that each column corresponds to one trained model, and thus the randomly varying performance across train supervision rates may be explained by simply results of minor differences in individual models. (Discussion)Figure 7: Time per iteration versus iteration number at test time. Note that the right hand side of the curve shows the expected linear complexity expected of MOCA. Note that for these experiments, no hypothesis pruning was performed, and thus at test time performance could be constant time as opposed to linear. This figure shows 95% confidence intervals for 10 trials, but the repeatability of the computation time is consistent enough that they are not visible. (Computational Performance)