AVID: A Benchmark for Omni-Modal Audio-Visual Inconsistency Understanding via Agent-Driven Construction

📅 2026-04-15

📈 Citations: 0

✨ Influential: 0

career value

232K/year

🤖 AI Summary

Existing models struggle to effectively perceive audio-visual inconsistencies in long-form videos and lack dedicated evaluation benchmarks. To address this gap, this work introduces AVID, the first large-scale benchmark for audio-visual inconsistency understanding, which injects eight fine-grained cross-modal conflicts into long videos through a scalable agent-driven pipeline. AVID supports comprehensive tasks including detection, temporal localization, classification, and reasoning. The framework integrates temporal segmentation, agent-based strategy planning, and five specialized injectors, with performance evaluated using a fine-tuned large language model, AVID-Qwen. Experimental results demonstrate that AVID-Qwen achieves a 2.8× improvement in paragraph-level reasoning (BLEU-4), a temporal localization mIoU of 36.1%, and an overall understanding score (SODA-m) of 7.47, significantly outperforming current state-of-the-art approaches.

Technology Category

Application Category

📝 Abstract

We present AVID, the first large-scale benchmark for audio-visual inconsistency understanding in videos. While omni-modal large language models excel at temporally aligned tasks such as captioning and question answering, they struggle to perceive cross-modal conflicts, a fundamental human capability that is critical for trustworthy AI. Existing benchmarks predominantly focus on aligned events or deepfake detection, leaving a significant gap in evaluating inconsistency perception in long-form video contexts. AVID addresses this with: (1) a scalable construction pipeline comprising temporal segmentation that classifies video content into Active Speaker, Voiceover, and Scenic categories; an agent-driven strategy planner that selects semantically appropriate inconsistency categories; and five specialized injectors for diverse audio-visual conflict injection; (2) 11.2K long videos (avg. 235.5s) with 39.4K annotated inconsistency events and 78.7K segment clips, supporting evaluation across detection, temporal grounding, classification, and reasoning with 8 fine-grained inconsistency categories. Comprehensive evaluations of state-of-the-art omni-models reveal significant limitations in temporal grounding and reasoning. Our fine-tuned baseline, AVID-Qwen, achieves substantial improvements over the base model (2.8$\times$ higher BLEU-4 in segment reasoning) and surpasses all compared models in temporal grounding (mIoU: 36.1\% vs 26.2\%) and holistic understanding (SODA-m: 7.47 vs 6.15), validating AVID as an effective testbed for advancing trustworthy omni-modal AI systems.

Problem

Research questions and friction points this paper is trying to address.

audio-visual inconsistency

omni-modal models

video understanding

cross-modal conflict

benchmark

Innovation

Methods, ideas, or system contributions that make the work stand out.

audio-visual inconsistency

omni-modal benchmark

agent-driven construction