LifeEval: A Multimodal Benchmark for Assistive AI in Egocentric Daily Life Tasks

📅 2026-02-28

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

Existing video benchmarks primarily focus on passive understanding and are ill-suited for evaluating the ability of multimodal large language models to provide real-time, interactive assistance for everyday tasks in dynamic real-world environments. To address this gap, this work proposes the first task-centric evaluation framework for real-time human-AI collaboration, grounded in continuous first-person video streams and natural dialogue. The authors construct a high-quality benchmark dataset encompassing six core capability dimensions, comprising 4,075 rigorously annotated samples. Systematic evaluation across 26 state-of-the-art models reveals significant deficiencies in timeliness, effectiveness, and interactive adaptability, thereby establishing a foundational benchmark for research on human-centered interactive intelligence in authentic everyday scenarios.

Technology Category

Application Category

📝 Abstract

The rapid progress of Multimodal Large Language Models (MLLMs) marks a significant step toward artificial general intelligence, offering great potential for augmenting human capabilities. However, their ability to provide effective assistance in dynamic, real-world environments remains largely underexplored. Existing video benchmarks predominantly assess passive understanding through retrospective analysis or isolated perception tasks, failing to capture the interactive and adaptive nature of real-time user assistance. To bridge this gap, we introduce LifeEval, a multimodal benchmark designed to evaluate real-time, task-oriented human-AI collaboration in daily life from an egocentric perspective. LifeEval emphasizes three key aspects: task-oriented holistic evaluation, egocentric real-time perception from continuous first-person streams, and human-assistant collaborative interaction through natural dialogues. Constructed via a rigorous annotation pipeline, the benchmark comprises 4,075 high-quality question-answer pairs across 6 core capability dimensions. Extensive evaluations of 26 state-of-the-art MLLMs on LifeEval reveal substantial challenges in achieving timely, effective and adaptive interaction, highlighting essential directions for advancing human-centered interactive intelligence.

Problem

Research questions and friction points this paper is trying to address.

Multimodal Large Language Models

egocentric perception

real-time assistance

human-AI collaboration

interactive intelligence

Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal benchmark

egocentric vision

real-time assistance