🤖 AI Summary
This work addresses the limitations of existing proactive interactive multimodal large language models, which lack sustained environmental awareness and personalization capabilities, while current evaluation benchmarks are confined to alarm scenarios, neglect user context, and struggle to assess interaction timing. To bridge this gap, we introduce EgoPro-Bench—the first benchmark for proactive interaction grounded in first-person video streams—spanning 12 domains with 12,000 training and 2,400 evaluation videos. Leveraging simulated user personas, it generates high-fidelity human-agent interaction data and enforces a “think briefly, interact optimally” principle to maximize interaction performance under constrained token budgets. By integrating personalized intent modeling, streaming video processing, and a tailored evaluation protocol, our approach significantly enhances the model’s understanding of user intent and accuracy in identifying opportune interaction moments, laying a foundation for user-centric proactive agents.
📝 Abstract
Existing Multimodal Large Language Models (MLLMs) remain primarily reactive, failing to continuously perceive environments or proactively assist users. While emerging benchmarks address proactivity, they are largely confined to alert scenarios, neglect personalized context, and fail to evaluate the precise timing of human-machine interactions (HMI).In this paper, we introduce EgoPro-Bench, a novel benchmark for training and evaluating proactive interaction capabilities based on streaming egocentric videos; it comprises 2,400 videos in the evaluation set and over 12,000 videos in the training set.Unlike previous works, EgoPro-Bench leverages simulated user profiles to generate diverse user intentions and to construct high-fidelity HMI data across 12 distinct domains.Subsequently, we propose a specialized evaluation protocol and metrics, train proactive interaction models designed for efficient reasoning and low-latency interaction on streaming video data, and conduct comprehensive evaluations.Furthermore, we introduce an interaction principle termed "short thinking, better interaction", which allocates a limited token budget prior to intent recognition, thereby enhancing interaction performance.The experiments demonstrate that EgoPro-Bench substantially enhances the intention understanding capabilities of MLLMs and enables accurate identification of appropriate timings for HMI, thereby laying a solid foundation for next-generation user-centric proactive interactive agents.