Memory Storyboard: Leveraging Temporal Segmentation for Streaming Self-Supervised Learning from Egocentric Videos

New York University
CoLLAs 2025 (Oral)

Abstract

Self-supervised learning holds the promise of learning good representations from real-world continuous uncurated data streams. However, most existing works in visual self-supervised learning focus on static images or artificial data streams. Towards exploring a more realistic learning substrate, we investigate streaming self-supervised learning from long-form real-world egocentric video streams. Inspired by the event segmentation mechanism in human perception and memory, we propose "Memory Storyboard," a novel continual self-supervised learning framework that groups recent past frames into temporal segments for a more effective summarization of the past visual streams for memory replay. To accommodate efficient temporal segmentation, we propose a two-tier memory hierarchy: the recent past is stored in a short-term memory, where the storyboard temporal segments are produced and then transferred to a long-term memory. Experiments on two real-world egocentric video datasets show that contrastive learning objectives on top of storyboard frames result in semantically meaningful representations that outperform those produced by state-of-the-art unsupervised continual learning methods.

The Memory Storyboard Framework

Our proposed Memory Storyboard framework for streaming SSL from egocentric videos. Similar frames are being clustered into temporal segments and their labels (text information for illustration purpose only) are updated in the long-term memory buffer for replay. SSL involves contrastive learning at both the frame and temporal segment levels.

Similar frames are being clustered into temporal segments and their labels (text information for illustration purpose only) are updated in the long-term memory buffer for replay. SSL involves contrastive learning at both the frame and temporal segment levels.

Two-tier Memory Structure

Long-term memory is updated with reservoir sampling, and short-term memory with first-in-firstout (FIFO). Temporal segmentation is applied on the shortterm memory, which then updates the labels of corresponding images in the long-term memory.

MY ALT TEXT

Temporal Segmentation Algorithm

The optimization objective of our segmentation algorithm is to maximize the average within-class similarity using a greedy algorithm.

MY ALT TEXT
Visualization of the temporal segments produced by Memory Storyboard on (a) SAYCam (b)(c) KrishnaCam at the end of training. The images are sampled at 10 seconds per frame. Each color bar correspond to a temporal class (the first and the last class might be incomplete).

Results

We demonstrate that Memory Storyboard achieves state-of-the-art performance on downstream ImageNet and iNaturalist classification tasks when trained on real-world egocentric video datasets. Among all the streaming self-supervised learning methods we evaluated, Memory Storyboard is the only one that is competitive with or even outperforms IID training when trained on these datasets.

MY ALT TEXT
Results on streaming SSL from SAYCam. Downstream evaluation on object classification (Accuracy %) for SSL models trained under the streaming setting. For "No Replay" and "IID" the results are the same for different memory buffer sizes. The "IID" methods are not under the streaming setting and are for reference only as a performance "upper bound" with the same number of gradient updates. Unless specified, standard reservoir sampling is used in the replay buffer.

Batch Composition Under Different Memory Constraints

We study the effects of training factors including label merging, subsampling rate, average segment length, memory buffer size, and training batch composition. These studies provide insight for more efficient streaming learning from videos. In particular, we explore the optimal composition ratio of the training batch from short-term vs. long-term memory, under different memory constraints. Larger batches from long-term memory improve performance when we can afford a large memory bank, while smaller batches can help prevent overfitting when we have a small memory bank.

MY ALT TEXT
Memory Storyboard performance on SAYCam with different long-term memory sizes (5k, 10k, 50k, and 100k) and varying training batch compositions (12.5% to 75.0% from short-term memory) using SVM readout. Each colored line represents the performance of different training batch compositions when the model has seen the same amount of data from the stream. Each black line represents the performance of different training batch compositions when the model has taken the same number of gradient updates.

Conclusion

The ability to continuously learn from large-scale uncurated streaming video data is crucial for applying self-supervised learning methods in real-world embodied agents. Existing works have limited exploration on this problem, have mainly focused on static datasets, and do not perform well in the streaming video setting. Inspired by the event segmentation mechanism in human cognition, in this work, we propose Memory Storyboard, which leverages temporal segmentation to produce a two-tier memory hierarchy akin to the short-term and long-term memory of humans. Memory Storyboard combines a temporal contrastive objective and a standard self-supervised contrastive objective to facilitate representation learning from scratch through streaming video experiences. Memory Storyboard achieves state-of-the-art performance on downstream classification and object detection tasks when trained on real-world large egocentric video datasets. By studying the effects of subsampling rates, average segment length, normalization, and optimal batch composition under different compute and memory constraints, we also offer valuable insights on the design choices for streaming self-supervised learning.