🤖 AI Summary
Existing benchmarks predominantly focus on short-term, single-modal tasks, failing to adequately evaluate first-person AI assistants’ integrated capabilities—multimodal input fusion, real-time responsiveness, and long-term memory retention—in realistic streaming scenarios.
Method: We introduce the first streaming, full-modality benchmark for extended daily tasks, spanning work/study, routine life, social interaction, and cultural travel. We propose two novel metrics—Real-Time Accuracy and Memory Persistence Time—and design 12 diagnostic subtasks to jointly assess memory retention, cross-temporal understanding, and reasoning under a unified temporal framework. Input streams include synchronized first-person video, audio, and text, augmented with human-refined visual descriptions and ASR transcripts.
Contribution/Results: We release a high-quality dataset comprising 3,291 QA pairs (averaging >14 hours per participant), enabling reproducible, fine-grained evaluation of embodied AI assistants in practical, temporally grounded settings.
📝 Abstract
Egocentric AI assistants in real-world settings must process multi-modal inputs (video, audio, text), respond in real time, and retain evolving long-term memory. However, existing benchmarks typically evaluate these abilities in isolation, lack realistic streaming scenarios, or support only short-term tasks. We introduce extbf{TeleEgo}, a long-duration, streaming, omni-modal benchmark for evaluating egocentric AI assistants in realistic daily contexts. The dataset features over 14 hours per participant of synchronized egocentric video, audio, and text across four domains: work & study, lifestyle & routines, social activities, and outings & culture. All data is aligned on a unified global timeline and includes high-quality visual narrations and speech transcripts, curated through human refinement.TeleEgo defines 12 diagnostic subtasks across three core capabilities: Memory (recalling past events), Understanding (interpreting the current moment), and Cross-Memory Reasoning (linking distant events). It contains 3,291 human-verified QA items spanning multiple question formats (single-choice, binary, multi-choice, and open-ended), evaluated strictly in a streaming setting. We propose two key metrics -- Real-Time Accuracy and Memory Persistence Time -- to jointly assess correctness, temporal responsiveness, and long-term retention. TeleEgo provides a realistic and comprehensive evaluation to advance the development of practical AI assistants.