LifelongMemory: Leveraging LLMs for Answering Queries in Long-form Egocentric Videos

Ying Wang, Yanlai Yang, Mengye Ren

New York University

Abstract

In this paper we introduce LifelongMemory, a new framework for accessing long-form egocentric videographic memory through natural language question answering and retrieval. LifelongMemory generates concise video activity descriptions of the camera wearer and leverages the zero-shot capabilities of pretrained large language models to perform reasoning over long-form video context. Furthermore, LifelongMemory uses a confidence and explanation module to produce confident, high-quality, and interpretable answers. Our approach achieves state-of-the-art performance on the EgoSchema benchmark for question answering and is highly competitive on the natural language query (NLQ) challenge of Ego4D.

Pipeline

Stage 1: Video Captioning

A multi-modal LLM (MLLM) produces captions from a list of short video clips. Content and query similarity filters are then applied to remove redundant and irrelevant captions.

Stage 2: LLM Reasoning

An LLM is instructed to take inputs from the list of condensed captions and retrieve the most relevant interval candidates.

Stage 3: Output Refinement

For video QA, ensemble the predictions of multiple runs using vote by confidence.

For NLQ, feed candidate intervals predicted by our previous stage into a pretrained NLQ model to obtain a finegrained prediction.

Visualization

We visualize the raw predictions of the LLM below. The LLM generates high-quality results without any post-processing.

Query: what time did i open the fridge?
LLM: C opens the fridge at 480-540.

Query: Where did I put the bucket?
LLM: The bucket is mentioned as being put on the ground twice in this interval.

Query: What did I put in the refrigerator?
LLM: You put a bottle in the refrigerator.

Query: Did i leave the fuel tank open?
LLM: You opened the tank, but it is not specified whether you left it open or not.

Query: What color is the keg I picked from the floor?
LLM: You picked up a jerrycan from the ground and put it down on the floor, but the color of the keg is not specified in the memory.

Query: Where was can drink before I drank it?
LLM: C picks a can. This is the action before C drank from the can.

Query: How many wood planks did I carry?
LLM: C carries the wood. The exact number of planks is not specified.

Query: Where was the egg before I picked it?
LLM: The egg was in the fridge before you picked it.

Quantitative Evaluation

Our approach achieves state-of-the-art performance on the EgoSchema benchmark for question answering and is highly competitive on the natural language query (NLQ) challenge of Ego4D.

We also quantitatively evaluate our proposed framework with different captioning models, LLMs, pre-processing techniques, prompts. Please read our paper for more details.

Video Presentation

BibTeX

@misc{wang2024lifelongmemory,
      title={LifelongMemory: Leveraging LLMs for Answering Queries in Long-form Egocentric Videos}, 
      author={Ying Wang and Yanlai Yang and Mengye Ren},
      year={2024},
      eprint={2312.05269},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}