XGC-AVis: Towards Audio-Visual Content Understanding with a Multi-Agent Collaborative System

📅 2025-09-27
📈 Citations: 0
✨ Influential: 0
📄 PDF
🤖 AI Summary
Multimodal large language models (MLLMs) suffer from inaccurate temporal alignment, inefficient key-segment retrieval in audiovisual understanding, and lack of quality-aware evaluation capabilities—particularly in mixed real-and-AIGC audiovisual scenarios. Method: We propose XGC-AVis, a perception–planning–execution–reflection four-stage multi-agent collaborative framework that enhances cross-modal temporal localization and joint reasoning over synchronization, coherence, and generation quality—without additional training. We further introduce XGC-AVQuiz, the first benchmark covering both real and AI-generated audiovisual content, incorporating quality-aware assessment and fine-grained temporal alignment as novel evaluation dimensions. Contribution/Results: Experiments demonstrate that XGC-AVis significantly improves MLLMs’ performance on temporal alignment and multidimensional quality assessment tasks, while uncovering critical bottlenecks in perception–cognition coordination within current models.

Technology Category

Application Category

📝 Abstract
In this paper, we propose XGC-AVis, a multi-agent framework that enhances the audio-video temporal alignment capabilities of multimodal large models (MLLMs) and improves the efficiency of retrieving key video segments through 4 stages: perception, planning, execution, and reflection. We further introduce XGC-AVQuiz, the first benchmark aimed at comprehensively assessing MLLMs' understanding capabilities in both real-world and AI-generated scenarios. XGC-AVQuiz consists of 2,685 question-answer pairs across 20 tasks, with two key innovations: 1) AIGC Scenario Expansion: The benchmark includes 2,232 videos, comprising 1,102 professionally generated content (PGC), 753 user-generated content (UGC), and 377 AI-generated content (AIGC). These videos cover 10 major domains and 53 fine-grained categories. 2) Quality Perception Dimension: Beyond conventional tasks such as recognition, localization, and reasoning, we introduce a novel quality perception dimension. This requires MLLMs to integrate low-level sensory capabilities with high-level semantic understanding to assess audio-visual quality, synchronization, and coherence. Experimental results on XGC-AVQuiz demonstrate that current MLLMs struggle with quality perception and temporal alignment tasks. XGC-AVis improves these capabilities without requiring additional training, as validated on two benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Enhancing audio-video temporal alignment in multimodal large models
Improving efficiency of retrieving key video segments through multi-agent collaboration
Assessing model understanding across real-world and AI-generated content scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent framework enhances audio-video temporal alignment
Four-stage process: perception, planning, execution, reflection
Improves capabilities without requiring additional model training
🔎 Similar Papers
No similar papers found.
Yuqin Cao
Yuqin Cao
Shanghai Jiao Tong University
X
Xiongkuo Min
Shanghai Jiao Tong University
Y
Yixuan Gao
Shanghai Jiao Tong University
W
Wei Sun
East China Normal University
Z
Zicheng Zhang
Shanghai Jiao Tong University, Shanghai AI Laboratory
J
Jinliang Han
Shanghai Jiao Tong University
Guangtao Zhai
Guangtao Zhai
Professor, IEEE Fellow, Shanghai Jiao Tong University
Multimedia Signal ProcessingVisual Quality AssessmentQoEAI EvaluationDisplays