Exploring Audio Hallucination in Egocentric Video Understanding

📅 2026-04-26

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

This work addresses the prevalent issue of audio hallucination in existing audio-visual large language models (AV-LLMs) when processing first-person videos—specifically, the erroneous inference of absent sounds based solely on visual cues. To tackle this, the authors propose the first fine-grained taxonomy for audio hallucination in this context, distinguishing between foreground action sounds and background environmental sounds. They further introduce a benchmark dataset comprising 300 videos and 1,000 sound-oriented question-answer pairs. Through a combination of an automated evaluation protocol, a grounded taxonomy-based classification approach, and human validation, they systematically assess state-of-the-art AV-LLMs, including Qwen2.5-Omni, revealing alarmingly low accuracies of 27.3% and 39.5% on foreground and background sound tasks, respectively—highlighting severe multimodal perception deficiencies in current models.

Technology Category

Application Category

📝 Abstract

Egocentric videos provide a distinctive setting in which sound serves as crucial cues to understand user activities and surroundings, particularly when visual information is unstable or occluded due to continuous camera movement. State-of-the-art large audio-visual language models (AV-LLMs) can generate multimodal descriptions. However, we show in this work that they are prone to audio hallucinations, often inferring sounds from visual cues that are visible but not heard. We present a systematic and automatic evaluation framework for analyzing audio hallucinations in egocentric video through a targeted question-answering (Q/A) protocol. We curate a dataset of 300 egocentric videos and design 1,000 sound-focused questions to probe model outputs. To characterize hallucinations, we propose a grounded taxonomy that distinguishes between foreground action sounds from the user activities and background ambient sounds. Our evaluation shows that advanced AV-LLMs, such as Qwen2.5 Omni, exhibit high hallucination rates, achieving only 27.3% and 39.5% accuracy on Q/As related to foreground and background sounds, respectively. With this work, we highlight the need to measure the reliability of multimodal responses, emphasizing that robust evaluation of hallucinations is essential to develop reliable AV-LLMs.

Problem

Research questions and friction points this paper is trying to address.

audio hallucination

egocentric video

audio-visual language models

multimodal reliability

sound perception

Innovation

Methods, ideas, or system contributions that make the work stand out.

audio hallucination

egocentric video

audio-visual language models