Memory Storyboard: Leveraging Temporal Segmentation for Streaming Self-Supervised Learning from Egocentric Videos

New York University

Abstract

Self-supervised learning holds the promise to learn good representations from real-world continuous uncurated data streams. However, most existing works in visual self-supervised learning focus on static images or artificial data streams. Towards exploring a more realistic learning substrate, we investigate streaming self-supervised learning from long-form real-world egocentric video streams. Inspired by the event segmentation mechanism in human perception and memory, we propose “Memory Storyboard” that groups recent past frames into temporal segments for more effective summarization of the past visual streams for memory replay. To accommodate efficient temporal segmentation, we propose a two-tier memory hierarchy: the recent past is stored in a short-term memory, and the storyboard temporal segments are then transferred to a long-term memory. Experiments on real-world egocentric video datasets including SAYCam and KrishnaCam show that contrastive learning objectives on top of storyboard frames result in semantically meaningful representations which outperform those produced by state-of-the-art unsupervised continual learning methods.

The Memory Storyboard Framework

Our proposed Memory Storyboard framework for streaming SSL from egocentric videos. Similar frames are being clustered into temporal segments and their labels (text information for illustration purpose only) are updated in the long-term memory buffer for replay. SSL involves contrastive learning at both the frame and temporal segment levels.

Similar frames are being clustered into temporal segments and their labels (text information for illustration purpose only) are updated in the long-term memory buffer for replay. SSL involves contrastive learning at both the frame and temporal segment levels.

Two-tier Memory Structure

Long-term memory is updated with reservoir sampling, and short-term memory with first-in-firstout (FIFO). Temporal segmentation is applied on the shortterm memory, which then updates the labels of corresponding images in the long-term memory.

MY ALT TEXT

Temporal Segmentation Algorithm

The optimization objective of our segmentation algorithm is to maximize the average within-class similarity using a greedy algorithm.

MY ALT TEXT
Visualization of the temporal segments produced by Memory Storyboard on (a) SAYCam (b)(c) KrishnaCam at the end of training. The images are sampled at 10 seconds per frame. Each color bar correspond to a temporal class (the first and the last class might be incomplete).

Conclusion

The ability to continuously learn from large-scale uncurated streaming video data is crucial for applying self-supervised learning methods in real-world embodied agents. Existing works have limited exploration on this problem, have mainly focused on static datasets, and do not perform well in the streaming video setting. Inspired by the event segmentation mechanism in human cognition, in this work, we propose Memory Storyboard, which leverages temporal segmentation to produce a two-tier memory hierarchy akin to the short-term and long-term memory of humans. Memory Storyboard combines a temporal contrastive objective and a standard self-supervised contrastive objective to facilitate representation learning from scratch through streaming video experiences. Memory Storyboard achieves state-of-the-art performance on downstream classification and object detection tasks when trained on real-world large egocentric video datasets. By studying the effects of subsampling rates, average segment length, normalization, and optimal batch composition under different compute and memory constraints, we also offer valuable insights on the design choices for streaming self-supervised learning.