Reawakening knowledge:
Anticipatory recovery from catastrophic interference via structured training

1New York University, 2University of Colorado, Boulder, 3Google DeepMind

Abstract

We explore the training dynamics of neural networks in a structured non-IID setting where documents are presented cyclically in a fixed, repeated sequence. Typically, networks suffer from catastrophic interference when training on a sequence of documents; however, we discover a curious and remarkable property of LLMs finetuned sequentially in this setting: they exhibit anticipatory behavior, recovering from the forgetting on documents before encountering them again. This behavior occurs even though the documents are never presented in context together. The behavior emerges and becomes more robust as the architecture scales up its number of parameters. Through comprehensive experiments and visualizations, we demonstrate a new mechanism by which over-parametrized neural networks can recover from catastrophic interference and uncover new insights into training over-parameterized networks in cyclically structured environments.

The Anticipatory Recovery Phenomenon

We uncover a very intriguing behavior when fine-tuning an LLM on N documents for E epochs with cyclic training, taking multiple gradient steps on each document each pass: starting from epoch 2, the loss on the first task stops increasing halfway through the cycle and starts to recover. In later epochs, more than 90% of the initial forgetting have been recovered before we cycle back to the first task. We call this surprising effect “anticipatory recovery.”

MY ALT TEXT
Cross Entropy Loss curves on the first document for cyclic and random shuffled fine-tuning.
The black circles indicate points just prior to training on the focal document.

Motivation

Most works in continual learning have focused on several very limited and artificial settings, such as task or class incremental learning. In these paradigms, the tasks are often completely disjoint with each other, and the old tasks do not appear again. This is very different from naturalistic data sequences that occurs in the real world, which have repetition and temporal structure.


In this paper, we study the simplest special case of sequential learning with temporal structure, cyclic training. In cyclic training, the tasks are iterated in the exact same order across different epochs. In our experiments, each task is training a large language model on a different document. In particular, we take a few gradient steps on each document before moving to the next one.

MY ALT TEXT

Understanding Anticipatory Recovery

We did a comprehensive analysis on how different training factors affect anticipatory recovery. We found that Anticipatory recovery occurs only when the network has sufficient width and depth such that it is well fitted to each document.


Namely, longer task sequences and more gradient steps on each task can facilitate the amount of recovery.

MY ALT TEXT
Effect of model size for pre-trained models on anticipatory recovery. “Recovery Score” refer to the average proportion of the initial forgetting during the last epoch that the model recovers before returning to the same document.
MY ALT TEXT
Effect of model size for models trained from scratch on anticipatory recovery.
MY ALT TEXT
Effect of number of documents (sequence length) for models trained from scratch on anticipatory recovery.
MY ALT TEXT
Effect of number of gradient steps for models trained from scratch on anticipatory recovery.

Visualizations

We made some initial progress towards understanding the underlying mechanisms that causes the anticipatory recovery phenomenon. We visualized how the model weights and activations change throughout cyclic training, and find that the trajectory forms a conic spiral in a low-dimensional manifold, and that the solutions to adjacent tasks become closer.

MY ALT TEXT
Top 3 PCA components of last layer weights in the first 3 epochs

Prequential Evaluation

Prequential evaluation refers to measuring the online loss, or the loss on the upcoming task, which matters the most for real-world agents.


As a result of anticipatory recovery, we show that training with fixed ordering achieves superior performance than random shuffling in the prequential evaluation setting. This result hints at the practical benefits of structured training.

MY ALT TEXT

Toy Computation Model

We devise a computation toy model that demonstrates a similar anticipatory recovery phenomenon in its loss curve, with a single learnable linear embedding layer and a learnable target vector with task-specific mappings. Please refer to the paper for more details.


Conclusion

We demonstrated the anticipatory recovery phenomenon—networks recover from the initial forgetting before seeing the same document again. This phenomenon is a sharp contrast with the well-known phenomenon of catastrophic interference, where forgetting increases monotonically as a network is trained on a sequence of different documents. Our research indicates that there is value in exploring naturalistic task sequences within continual learning.