EgoAdapt: Enhancing Robustness in Egocentric Interactive Speaker Detection Under Missing Modalities

📅 2026-03-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited robustness in egocentric “Talk-to-Me” (TTM) speaker detection caused by missing visual modalities, neglect of head orientation, and background noise interference. To this end, the authors propose EgoAdapt, a novel adaptive framework that, for the first time, incorporates head orientation as a critical nonverbal cue into the TTM task. The method fuses multimodal signals—including lip motion and head orientation—and introduces a Visual Modality Missing-Aware (VMMA) mechanism alongside a Parallel Shared-weight Audio (PSA) encoder to dynamically adapt to missing modalities and enhance noise robustness. Evaluated on the Ego4D dataset, the proposed approach achieves 67.39% mAP and 62.01% accuracy, outperforming the current state-of-the-art by 1.56% and 4.96%, respectively.

Technology Category

Application Category

📝 Abstract
TTM (Talking to Me) task is a pivotal component in understanding human social interactions, aiming to determine who is engaged in conversation with the camera-wearer. Traditional models often face challenges in real-world scenarios due to missing visual data, neglecting the role of head orientation, and background noise. This study addresses these limitations by introducing EgoAdapt, an adaptive framework designed for robust egocentric "Talking to Me" speaker detection under missing modalities. Specifically, EgoAdapt incorporates three key modules: (1) a Visual Speaker Target Recognition (VSTR) module that captures head orientation as a non-verbal cue and lip movement as a verbal cue, allowing a comprehensive interpretation of both verbal and non-verbal signals to address TTM, setting it apart from tasks focused solely on detecting speaking status; (2) a Parallel Shared-weight Audio (PSA) encoder for enhanced audio feature extraction in noisy environments; and (3) a Visual Modality Missing Awareness (VMMA) module that estimates the presence or absence of each modality at each frame to adjust the system response dynamically.Comprehensive evaluations on the TTM benchmark of the Ego4D dataset demonstrate that EgoAdapt achieves a mean Average Precision (mAP) of 67.39% and an Accuracy (Acc) of 62.01%, significantly outperforming the state-of-the-art method by 4.96% in Accuracy and 1.56% in mAP.
Problem

Research questions and friction points this paper is trying to address.

Egocentric
Speaker Detection
Missing Modalities
Talking to Me
Robustness
Innovation

Methods, ideas, or system contributions that make the work stand out.

EgoAdapt
modality missing
head orientation
audio-visual fusion
egocentric speaker detection
Xinyuan Qian
Xinyuan Qian
Associate Professor, University of Science and Technology Beijing, China
speech processingmultimediahuman robot interaction
X
Xinjia Zhu
MIIT Key Laboratory of Pattern Analysis and Machine Intelligence, College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, China
Alessio Brutti
Alessio Brutti
FBK
audio/speech processing
D
Dong Liang
MIIT Key Laboratory of Pattern Analysis and Machine Intelligence, College of Computer Science and Technology, Shenzhen Research Institute, Nanjing University of Aeronautics and Astronautics, China