Multi-Modal Scene Graph with Kolmogorov-Arnold Experts for Audio-Visual Question Answering

📅 2025-11-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing audio-visual question answering (AVQA) methods struggle to model structured semantics—such as objects and their spatiotemporal relationships—in videos, and lack robust cross-modal temporal reasoning capabilities. To address this, we propose the first multi-modal scene graph (MSG) representation framework tailored for AVQA, which explicitly encodes visual objects, auditory sources, and their spatiotemporal relations. We further introduce a Kolmogorov–Arnold Network (KAN)-driven Mixture-of-Experts (MoE) architecture that enables question-aware, fine-grained cross-modal fusion and dynamic temporal modeling. Our approach achieves state-of-the-art performance on the MUSIC-AVQA and MUSIC-AVQA v2 benchmarks, significantly improving QA accuracy in complex multimodal scenarios. By unifying structural scene understanding with adaptive temporal reasoning, this work establishes a novel paradigm for structured multimodal reasoning.

Technology Category

Application Category

📝 Abstract
In this paper, we propose a novel Multi-Modal Scene Graph with Kolmogorov-Arnold Expert Network for Audio-Visual Question Answering (SHRIKE). The task aims to mimic human reasoning by extracting and fusing information from audio-visual scenes, with the main challenge being the identification of question-relevant cues from the complex audio-visual content. Existing methods fail to capture the structural information within video, and suffer from insufficient fine-grained modeling of multi-modal features. To address these issues, we are the first to introduce a new multi-modal scene graph that explicitly models the objects and their relationship as a visually grounded, structured representation of the audio-visual scene. Furthermore, we design a Kolmogorov-Arnold Network~(KAN)-based Mixture of Experts (MoE) to enhance the expressive power of the temporal integration stage. This enables more fine-grained modeling of cross-modal interactions within the question-aware fused audio-visual representation, leading to capture richer and more nuanced patterns and then improve temporal reasoning performance. We evaluate the model on the established MUSIC-AVQA and MUSIC-AVQA v2 benchmarks, where it achieves state-of-the-art performance. Code and model checkpoints will be publicly released.
Problem

Research questions and friction points this paper is trying to address.

Extracting and fusing audio-visual cues for question answering.
Modeling structured relationships in audio-visual scenes.
Enhancing fine-grained cross-modal interaction modeling.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modal scene graph models object relationships
KAN-based Mixture of Experts enhances temporal integration
Fine-grained cross-modal interaction modeling improves reasoning
🔎 Similar Papers
No similar papers found.
Z
Zijian Fu
State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, China
Changsheng Lv
Changsheng Lv
Beijing University of Posts and Telecommunications
Scene Graph GenerationAutonomous Driving
M
Mengshi Qi
State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, China
Huadong Ma
Huadong Ma
BUPT
Internet of ThingsMultimedia