Agentic Active Omni-Modal Perception for Multi-Hop Audio-Visual Reasoning

📅 2026-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of multi-hop audio-visual reasoning, where relevant evidence is sparse, temporally dispersed, and distributed across modalities, rendering existing methods ineffective. To overcome these limitations, the authors propose the AOP-Agent framework, which introduces an active omni-modal perception mechanism built upon the open-source Omni-LLM architecture. Notably, this approach achieves efficient reasoning without requiring model fine-tuning or proprietary components. The framework integrates hierarchical omni-modal memory with a collaborative observe-reflect-replan loop, substantially enhancing performance on both MOV-Bench and OmniVideoBench benchmarks. It demonstrates particularly strong capabilities in long-form video understanding and complex multi-hop reasoning scenarios.
📝 Abstract
Multi-hop audio-visual reasoning remains challenging for Omni-LLMs, as relevant evidence is often sparse, temporally dispersed, and distributed across both audio and visual streams. Existing benchmarks provide limited investigation of this setting, typically involving only a limited number of modalities, relevant temporal segments, or reasoning steps. In this work, we introduce MOV-Bench, a benchmark containing 519 carefully curated questions that require multi-hop reasoning over temporally dispersed audio-visual evidence. Evaluations on MOV-Bench reveal that current Omni-LLMs still struggle with multi-hop cross-modal reasoning. To address this challenge, we further propose AOP-Agent, an efficient agentic framework built on open-source Omni-LLMs for active omni-modal perception. By combining a hierarchical omni-modal memory with a collaborative observe-reflect-replan loop, AOP-Agent enables open-source Omni-LLMs to perform active perception without additional training or proprietary models. Experiments on MOV-Bench and OmniVideoBench demonstrate that AOP-Agent consistently improves reasoning performance, with particularly notable gains on long videos and reasoning-intensive questions.
Problem

Research questions and friction points this paper is trying to address.

multi-hop reasoning
audio-visual perception
Omni-LLMs
temporal dispersion
cross-modal reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

agentic active perception
multi-hop audio-visual reasoning
omni-modal memory
observe-reflect-replan loop
Omni-LLMs