TeleEgo: Benchmarking Egocentric AI Assistants in the Wild

📅 2025-10-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing benchmarks predominantly focus on short-term, single-modal tasks, failing to adequately evaluate first-person AI assistants’ integrated capabilities—multimodal input fusion, real-time responsiveness, and long-term memory retention—in realistic streaming scenarios. Method: We introduce the first streaming, full-modality benchmark for extended daily tasks, spanning work/study, routine life, social interaction, and cultural travel. We propose two novel metrics—Real-Time Accuracy and Memory Persistence Time—and design 12 diagnostic subtasks to jointly assess memory retention, cross-temporal understanding, and reasoning under a unified temporal framework. Input streams include synchronized first-person video, audio, and text, augmented with human-refined visual descriptions and ASR transcripts. Contribution/Results: We release a high-quality dataset comprising 3,291 QA pairs (averaging >14 hours per participant), enabling reproducible, fine-grained evaluation of embodied AI assistants in practical, temporally grounded settings.

Technology Category

Application Category

📝 Abstract
Egocentric AI assistants in real-world settings must process multi-modal inputs (video, audio, text), respond in real time, and retain evolving long-term memory. However, existing benchmarks typically evaluate these abilities in isolation, lack realistic streaming scenarios, or support only short-term tasks. We introduce extbf{TeleEgo}, a long-duration, streaming, omni-modal benchmark for evaluating egocentric AI assistants in realistic daily contexts. The dataset features over 14 hours per participant of synchronized egocentric video, audio, and text across four domains: work & study, lifestyle & routines, social activities, and outings & culture. All data is aligned on a unified global timeline and includes high-quality visual narrations and speech transcripts, curated through human refinement.TeleEgo defines 12 diagnostic subtasks across three core capabilities: Memory (recalling past events), Understanding (interpreting the current moment), and Cross-Memory Reasoning (linking distant events). It contains 3,291 human-verified QA items spanning multiple question formats (single-choice, binary, multi-choice, and open-ended), evaluated strictly in a streaming setting. We propose two key metrics -- Real-Time Accuracy and Memory Persistence Time -- to jointly assess correctness, temporal responsiveness, and long-term retention. TeleEgo provides a realistic and comprehensive evaluation to advance the development of practical AI assistants.
Problem

Research questions and friction points this paper is trying to address.

Evaluating egocentric AI assistants in realistic daily streaming scenarios
Assessing multi-modal memory, understanding, and cross-memory reasoning capabilities
Measuring real-time accuracy and long-term memory persistence in assistants
Innovation

Methods, ideas, or system contributions that make the work stand out.

TeleEgo benchmark tests egocentric AI assistants
It uses synchronized multi-modal real-world data streams
Measures real-time accuracy and long-term memory persistence
🔎 Similar Papers
No similar papers found.
J
Jiaqi Yan
Institute of Artificial Intelligence (TeleAI), China Telecom
R
Ruilong Ren
Institute of Artificial Intelligence (TeleAI), China Telecom
Jingren Liu
Jingren Liu
PhD student, Tianjin University
Continual LearningLong-form Video UnderstandingUnified Models
S
Shuning Xu
Institute of Artificial Intelligence (TeleAI), China Telecom
L
Ling Wang
Institute of Artificial Intelligence (TeleAI), China Telecom
Y
Yiheng Wang
Institute of Artificial Intelligence (TeleAI), China Telecom
Y
Yun Wang
Institute of Artificial Intelligence (TeleAI), China Telecom
L
Long Zhang
Institute of Artificial Intelligence (TeleAI), China Telecom
X
Xiangyu Chen
Institute of Artificial Intelligence (TeleAI), China Telecom
Changzhi Sun
Changzhi Sun
Institute of Artificial Intelligence (TeleAI), China Telecom
Machine LearningNatural Language ProcessingAI for Science
Jixiang Luo
Jixiang Luo
Sensetime
Data compressionVideo CodingSignal Processing
Dell Zhang
Dell Zhang
Institute of Artificial Intelligence (TeleAI), China Telecom
Machine LearningInformation RetrievalNatural Language Processing
H
Hao Sun
Institute of Artificial Intelligence (TeleAI), China Telecom
C
Chi Zhang
Institute of Artificial Intelligence (TeleAI), China Telecom
X
Xuelong Li
Institute of Artificial Intelligence (TeleAI), China Telecom