OmniAgent: Audio-Guided Active Perception Agent for Omnimodal Audio-Video Understanding

📅 2025-12-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current multimodal large models face bottlenecks in fine-grained audiovisual understanding and cross-modal alignment. To address this, we propose an audio-driven active perception agent that shifts from passive comprehension to proactive, task-oriented cross-modal interrogation—enabling dynamic querying beyond static frame description. Our method introduces a “coarse-to-fine” audio-guided perception framework and a dynamic tool orchestration mechanism integrating audio event localization, multi-stage attention focusing, and audiovisual collaborative reasoning. This design achieves modality-adaptive alignment and precise spatiotemporal event localization. Evaluated on three major audiovisual understanding benchmarks, our approach achieves state-of-the-art performance, outperforming the best open-source and closed-source models by 10–20% in accuracy. The results demonstrate substantial improvements in fine-grained, joint audiovisual reasoning capability.

Technology Category

Application Category

📝 Abstract
Omnimodal large language models have made significant strides in unifying audio and visual modalities; however, they often lack the fine-grained cross-modal understanding and have difficulty with multimodal alignment. To address these limitations, we introduce OmniAgent, a fully audio-guided active perception agent that dynamically orchestrates specialized tools to achieve more fine-grained audio-visual reasoning. Unlike previous works that rely on rigid, static workflows and dense frame-captioning, this paper demonstrates a paradigm shift from passive response generation to active multimodal inquiry. OmniAgent employs dynamic planning to autonomously orchestrate tool invocation on demand, strategically concentrating perceptual attention on task-relevant cues. Central to our approach is a novel coarse-to-fine audio-guided perception paradigm, which leverages audio cues to localize temporal events and guide subsequent reasoning. Extensive empirical evaluations on three audio-video understanding benchmarks demonstrate that OmniAgent achieves state-of-the-art performance, surpassing leading open-source and proprietary models by substantial margins of 10% - 20% accuracy.
Problem

Research questions and friction points this paper is trying to address.

Achieves fine-grained audio-visual reasoning through dynamic tool orchestration
Shifts from passive response generation to active multimodal inquiry paradigm
Uses audio-guided perception to localize temporal events for video understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic tool orchestration for audio-visual reasoning
Coarse-to-fine audio-guided perception paradigm
Active multimodal inquiry replacing passive response generation
🔎 Similar Papers
No similar papers found.
Keda Tao
Keda Tao
Westlake University
Generative ModelComputer VisionMLLM
W
Wenjie Du
Westlake University
B
Bohan Yu
Ant Group
W
Weiqiang Wang
Ant Group
J
Jian Liu
Ant Group
H
Huan Wang
Westlake University