🤖 AI Summary
To address data scarcity, high human evaluation costs, and strict latency requirements in real-time streaming vision-guided systems, this paper introduces the first active AI assistant framework tailored for first-person videos. Methodologically, we (i) construct a large-scale synthetic multi-domain video-dialogue dataset; (ii) design an end-to-end streaming architecture integrating video chunk encoding, cross-modal alignment, dynamic context caching, and imbalance-aware learning; and (iii) establish a human-validated automated evaluation metric suite. Our contributions are threefold: (1) We release the first open-source synthetic first-person video-dialogue dataset; (2) Our model achieves high response relevance and practical utility under <300 ms end-to-end latency; and (3) Our automated metrics exhibit strong agreement with human judgments (Spearman ρ > 0.87), significantly improving evaluation efficiency and reproducibility.
📝 Abstract
Recent advances in conversational AI have been substantial, but developing real-time systems for perceptual task guidance remains challenging. These systems must provide interactive, proactive assistance based on streaming visual inputs, yet their development is constrained by the costly and labor-intensive process of data collection and system evaluation. To address these limitations, we present a comprehensive framework with three key contributions. First, we introduce a novel data curation pipeline that synthesizes dialogues from annotated egocentric videos, resulting in dataset, a large-scale synthetic dialogue dataset spanning multiple domains. Second, we develop a suite of automatic evaluation metrics, validated through extensive human studies. Third, we propose an end-to-end model that processes streaming video inputs to generate contextually appropriate responses, incorporating novel techniques for handling data imbalance and long-duration videos. This work lays the foundation for developing real-time, proactive AI assistants capable of guiding users through diverse tasks. Project page: https://pro-assist.github.io/