Agentic Active Omni-Modal Perception for Multi-Hop Audio-Visual Reasoning

📅 2026-05-27

📈 Citations: 0

✨ Influential: 0

career value

231K/year

🤖 AI Summary

This work addresses the challenges of multi-hop audio-visual reasoning, where relevant evidence is sparse, temporally dispersed, and distributed across modalities, rendering existing methods ineffective. To overcome these limitations, the authors propose the AOP-Agent framework, which introduces an active omni-modal perception mechanism built upon the open-source Omni-LLM architecture. Notably, this approach achieves efficient reasoning without requiring model fine-tuning or proprietary components. The framework integrates hierarchical omni-modal memory with a collaborative observe-reflect-replan loop, substantially enhancing performance on both MOV-Bench and OmniVideoBench benchmarks. It demonstrates particularly strong capabilities in long-form video understanding and complex multi-hop reasoning scenarios.

📝 Abstract

Multi-hop audio-visual reasoning remains challenging for Omni-LLMs, as relevant evidence is often sparse, temporally dispersed, and distributed across both audio and visual streams. Existing benchmarks provide limited investigation of this setting, typically involving only a limited number of modalities, relevant temporal segments, or reasoning steps. In this work, we introduce MOV-Bench, a benchmark containing 519 carefully curated questions that require multi-hop reasoning over temporally dispersed audio-visual evidence. Evaluations on MOV-Bench reveal that current Omni-LLMs still struggle with multi-hop cross-modal reasoning. To address this challenge, we further propose AOP-Agent, an efficient agentic framework built on open-source Omni-LLMs for active omni-modal perception. By combining a hierarchical omni-modal memory with a collaborative observe-reflect-replan loop, AOP-Agent enables open-source Omni-LLMs to perform active perception without additional training or proprietary models. Experiments on MOV-Bench and OmniVideoBench demonstrate that AOP-Agent consistently improves reasoning performance, with particularly notable gains on long videos and reasoning-intensive questions.

Problem

Research questions and friction points this paper is trying to address.

multi-hop reasoning

audio-visual perception

Omni-LLMs

temporal dispersion

cross-modal reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

agentic active perception

multi-hop audio-visual reasoning

omni-modal memory