Omni-o3: Deep Nested Omnimodal Deduction for Deliberative Audio-Visual Reasoning

📅 2026-04-27

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

Existing approaches to multimodal audio-visual reasoning often struggle to efficiently explore the vast cross-modal interaction space due to isolated reasoning trajectories, leading to error accumulation. This work proposes Omni-o3, a framework that formulates reasoning as a dynamic recursive search process, jointly executing four atomic cognitive operations—expansion, selection, simulation, and backpropagation—through shared prefixes. The method introduces an innovative deeply nested deductive strategy and employs a two-stage training paradigm: first, supervised fine-tuning with large-scale high-quality reasoning trajectories to establish recursive search capabilities, followed by nested population-based rollout reinforcement learning to elicit deeper reasoning. Experiments demonstrate that Omni-o3 significantly advances performance across eleven benchmarks, encompassing audio-visual, vision-dominant, and audio-dominant reasoning tasks.

Technology Category

Application Category

📝 Abstract

Omnimodal understanding entails a massive, highly redundant search space of cross-modal interactions, demanding focused and deliberative reasoning. Current reasoning paradigms rely on either sequential step-by-step generation or parallel sample-by-sample rollouts, leading to isolated reasoning trajectories. This inability to share promising intermediate paths severely limits exploration efficiency and causes compounding errors in complex audio-visual tasks. To break this bottleneck, we introduce Omni-o3, a novel framework driven by a deep nested deduction policy. By formulating reasoning as a dynamic recursive search, Omni-o3 inherently shares reasoning prefixes across branches, enabling the iterative execution of four atomic cognitive actions: expansion, selection, simulation, and backpropagation. To empower this framework, we propose a robust two-stage training paradigm: (1) cold-start supervised fine-tuning on 101K high-quality, long-chain trajectories distilled from 3.5M diverse omnimodal samples, enabling necessary recursive search patterns; and (2) nested group rollout-driven exploratory reinforcement learning on 18K complex multi-turn samples, explicitly guided by a novel multi-step reward model to stimulate deep nested reasoning. Extensive experiments demonstrate that Omni-o3 achieves competitive performance across 11 benchmarks, unlocking advanced capabilities in comprehensive audio-visual, visual-centric, and audio-centric reasoning tasks.

Problem

Research questions and friction points this paper is trying to address.

omnimodal reasoning

audio-visual reasoning

reasoning trajectories

cross-modal interactions

exploration efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

deep nested deduction

recursive search

omnimodal reasoning