Visual learning has made significant progress with single-subject, iconic images, but learning useful representations from long-form egocentric videos remains challenging. These videos provide a naturalistic, embodied sensory experience, offering spatiotemporal grounding in the real world. Leveraging multimodality, object motion, temporal structure and consistency can improve performance and data efficiency, yet learning is also hindered by constant and continual distribution shifts. To address these challenges, we have developed methods for incremental recognition in open-world environments, unsupervised continual representation learning, and video representation learning. Our vision is to build efficient and adaptable learning algorithms for on-device visual learning from streaming embodied experiences, with applications in downstream tasks such as planning and visual assistance, where the learned representations are broadly useful.
Memory Storyboard groups recent past frames into temporal segments and provides effective summarization of the past visual streams for memory replay.
Published: 2025-01-21
Learn moreWe propose PooDLe, a self-supervised learning method that combines an invariance-based objective on pooled representations with a dense SSL objective that enforces equivariance to optical flow warping.
Published: 2024-08-20
Learn moreWe formulate Osiris, a unifying framework for unsupervised continual learning (UCL), which disentangles learning objectives that encompass stability, plasticity, and cross-task consolidation.
Published: 2024-04-29
Learn moreWe train self-supervised video models on longitudinal, egocentric headcam recordings collected from a child over a two year period in their early development.
Published: 2024-02-01
Learn moreLifelongMemory is a new framework for accessing long-form egocentric videographic memory through natural language question answering and retrieval.
Published: 2023-12-07
Learn more