EgoAVU: Egocentric Audio-Visual Understanding

📅 2026-02-05

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

Existing first-person multimodal large language models struggle to effectively integrate visual and audio semantics due to the scarcity of high-quality cross-modal aligned data. To address this limitation, this work proposes EgoAVU, a data engine that enables scalable, automatic generation of first-person audio-visual semantic pairs for the first time. Leveraging graph-structured modular filtering, cross-modal correlation modeling, and token-based video filtering, the framework constructs both the EgoAVU-Instruct training set and the EgoAVU-Bench evaluation benchmark. Experimental results demonstrate that models fine-tuned on this data achieve a 113% performance gain on EgoAVU-Bench and obtain up to a 28% relative improvement on external benchmarks such as EgoTempo and EgoIllusion, significantly alleviating the models’ overreliance on visual modality.

Technology Category

Application Category

📝 Abstract

Understanding egocentric videos plays a vital role for embodied intelligence. Recent multi-modal large language models (MLLMs) can accept both visual and audio inputs. However, due to the challenge of obtaining text labels with coherent joint-modality information, whether MLLMs can jointly understand both modalities in egocentric videos remains under-explored. To address this problem, we introduce EgoAVU, a scalable data engine to automatically generate egocentric audio-visual narrations, questions, and answers. EgoAVU enriches human narrations with multimodal context and generates audio-visual narrations through cross-modal correlation modeling. Token-based video filtering and modular, graph-based curation ensure both data diversity and quality. Leveraging EgoAVU, we construct EgoAVU-Instruct, a large-scale training dataset of 3M samples, and EgoAVU-Bench, a manually verified evaluation split covering diverse tasks. EgoAVU-Bench clearly reveals the limitations of existing MLLMs: they bias heavily toward visual signals, often neglecting audio cues or failing to correspond audio with the visual source. Finetuning MLLMs on EgoAVU-Instruct effectively addresses this issue, enabling up to 113% performance improvement on EgoAVU-Bench. Such benefits also transfer to other benchmarks such as EgoTempo and EgoIllusion, achieving up to 28% relative performance gain. Code will be released to the community.

Problem

Research questions and friction points this paper is trying to address.

egocentric video

audio-visual understanding

multimodal large language models

cross-modal correlation

embodied intelligence

Innovation

Methods, ideas, or system contributions that make the work stand out.

Egocentric audio-visual understanding

Multimodal large language models

Cross-modal correlation modeling