Learning to Generate Long-term Future Narrations Describing Activities of Daily Living

📅 2025-03-03

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This paper introduces the novel task of *long-horizon future narrative generation*, which aims to produce coherent, natural-language descriptions of daily activities occurring over the next several minutes, conditioned on egocentric video input—targeting applications in health monitoring, smart homes, and behavioral analysis. To address this, we propose ViNa, the first end-to-end vision–language model for this task, integrating long-sequence video encoding, cross-modal temporal alignment, and autoregressive narrative decoding. We further introduce *future video retrieval* as a new downstream application to enable interpretable, temporally grounded task planning visualization. Evaluated on the Ego4D dataset, ViNa substantially outperforms short-horizon prediction baselines, achieving state-of-the-art performance. Generated narratives exhibit high temporal consistency and activity plausibility, marking the first successful semantic modeling of minute-scale future behavior in realistic, everyday settings.

Technology Category

Application Category

📝 Abstract

Anticipating future events is crucial for various application domains such as healthcare, smart home technology, and surveillance. Narrative event descriptions provide context-rich information, enhancing a system's future planning and decision-making capabilities. We propose a novel task: $ extit{long-term future narration generation}$, which extends beyond traditional action anticipation by generating detailed narrations of future daily activities. We introduce a visual-language model, ViNa, specifically designed to address this challenging task. ViNa integrates long-term videos and corresponding narrations to generate a sequence of future narrations that predict subsequent events and actions over extended time horizons. ViNa extends existing multimodal models that perform only short-term predictions or describe observed videos by generating long-term future narrations for a broader range of daily activities. We also present a novel downstream application that leverages the generated narrations called future video retrieval to help users improve planning for a task by visualizing the future. We evaluate future narration generation on the largest egocentric dataset Ego4D.

Problem

Research questions and friction points this paper is trying to address.

Generates detailed narrations of future daily activities.

Extends beyond short-term predictions to long-term future narrations.

Enhances future planning and decision-making in various domains.

Innovation

Methods, ideas, or system contributions that make the work stand out.

ViNa model integrates long-term videos and narrations

Generates detailed future narrations for daily activities

Enables future video retrieval for task planning

🔎 Similar Papers

Large Language Models are Zero-Shot Recognizers for Activities of Daily Living