🤖 AI Summary
Existing video benchmarks primarily focus on passive understanding and are ill-suited for evaluating the ability of multimodal large language models to provide real-time, interactive assistance for everyday tasks in dynamic real-world environments. To address this gap, this work proposes the first task-centric evaluation framework for real-time human-AI collaboration, grounded in continuous first-person video streams and natural dialogue. The authors construct a high-quality benchmark dataset encompassing six core capability dimensions, comprising 4,075 rigorously annotated samples. Systematic evaluation across 26 state-of-the-art models reveals significant deficiencies in timeliness, effectiveness, and interactive adaptability, thereby establishing a foundational benchmark for research on human-centered interactive intelligence in authentic everyday scenarios.
📝 Abstract
The rapid progress of Multimodal Large Language Models (MLLMs) marks a significant step toward artificial general intelligence, offering great potential for augmenting human capabilities. However, their ability to provide effective assistance in dynamic, real-world environments remains largely underexplored. Existing video benchmarks predominantly assess passive understanding through retrospective analysis or isolated perception tasks, failing to capture the interactive and adaptive nature of real-time user assistance. To bridge this gap, we introduce LifeEval, a multimodal benchmark designed to evaluate real-time, task-oriented human-AI collaboration in daily life from an egocentric perspective. LifeEval emphasizes three key aspects: task-oriented holistic evaluation, egocentric real-time perception from continuous first-person streams, and human-assistant collaborative interaction through natural dialogues. Constructed via a rigorous annotation pipeline, the benchmark comprises 4,075 high-quality question-answer pairs across 6 core capability dimensions. Extensive evaluations of 26 state-of-the-art MLLMs on LifeEval reveal substantial challenges in achieving timely, effective and adaptive interaction, highlighting essential directions for advancing human-centered interactive intelligence.