EgoAdapt: Enhancing Robustness in Egocentric Interactive Speaker Detection Under Missing Modalities

📅 2026-03-18

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

This work addresses the limited robustness in egocentric “Talk-to-Me” (TTM) speaker detection caused by missing visual modalities, neglect of head orientation, and background noise interference. To this end, the authors propose EgoAdapt, a novel adaptive framework that, for the first time, incorporates head orientation as a critical nonverbal cue into the TTM task. The method fuses multimodal signals—including lip motion and head orientation—and introduces a Visual Modality Missing-Aware (VMMA) mechanism alongside a Parallel Shared-weight Audio (PSA) encoder to dynamically adapt to missing modalities and enhance noise robustness. Evaluated on the Ego4D dataset, the proposed approach achieves 67.39% mAP and 62.01% accuracy, outperforming the current state-of-the-art by 1.56% and 4.96%, respectively.

Technology Category

Application Category

📝 Abstract

TTM (Talking to Me) task is a pivotal component in understanding human social interactions, aiming to determine who is engaged in conversation with the camera-wearer. Traditional models often face challenges in real-world scenarios due to missing visual data, neglecting the role of head orientation, and background noise. This study addresses these limitations by introducing EgoAdapt, an adaptive framework designed for robust egocentric "Talking to Me" speaker detection under missing modalities. Specifically, EgoAdapt incorporates three key modules: (1) a Visual Speaker Target Recognition (VSTR) module that captures head orientation as a non-verbal cue and lip movement as a verbal cue, allowing a comprehensive interpretation of both verbal and non-verbal signals to address TTM, setting it apart from tasks focused solely on detecting speaking status; (2) a Parallel Shared-weight Audio (PSA) encoder for enhanced audio feature extraction in noisy environments; and (3) a Visual Modality Missing Awareness (VMMA) module that estimates the presence or absence of each modality at each frame to adjust the system response dynamically.Comprehensive evaluations on the TTM benchmark of the Ego4D dataset demonstrate that EgoAdapt achieves a mean Average Precision (mAP) of 67.39% and an Accuracy (Acc) of 62.01%, significantly outperforming the state-of-the-art method by 4.96% in Accuracy and 1.56% in mAP.

Problem

Research questions and friction points this paper is trying to address.

Egocentric

Speaker Detection

Missing Modalities

Talking to Me

Robustness

Innovation

Methods, ideas, or system contributions that make the work stand out.

EgoAdapt

modality missing

head orientation