π€ AI Summary
This work addresses the limitations of existing multimodal large language models in fine-grained audio-visual joint reasoning, which often rely on explicit textual chains of thought that compress continuous signals, disrupt temporal alignment, and suffer from linguistic priors. To overcome these issues, the authors propose LatentOmni, a framework that interleaves textual reasoning with perceptual state modeling within a unified audio-visual latent space, enabling tight cross-modal joint reasoning. Key innovations include feature-level supervised alignment between task-relevant perceptual features and reasoning states, Omni-Sync positional encoding to preserve temporal consistency, and the introduction of LatentOmni-Instruct-35Kβthe first dataset for interleaved audio-visual latent reasoning. Experiments demonstrate that the proposed method significantly outperforms existing open-source models and explicit chain-of-thought baselines across multiple audio-visual reasoning benchmarks, validating the efficacy of latent-space joint reasoning.
π Abstract
Joint audio-visual reasoning is essential for omnimodal understanding, yet current multimodal large language models (MLLMs) still struggle when reasoning requires fine-grained evidence from both modalities. A central limitation is that explicit text-based chain-of-thought (CoT) compresses continuous audio-visual signals into discrete tokens, weakening temporal grounding and shifting intermediate reasoning toward language priors. We argue that a unified latent space is a better medium for such reasoning because it preserves dense sensory information while remaining compatible with autoregressive generation. Based on this insight, we propose \textbf{LatentOmni}, a cross-modal reasoning framework that interleaves textual reasoning with audio-visual latent states. LatentOmni introduces feature-level supervision to align latent reasoning states with task-relevant sensory features and uses Omni-Sync Position Embedding (OSPE) to maintain temporal consistency between latent audio and visual states. We further construct \textbf{LatentOmni-Instruct-35K}, a dataset of audio-visual interleaved reasoning trajectories for supervising latent-space reasoning. Comprehensive evaluation across multiple audio-visual reasoning benchmarks demonstrates that LatentOmni achieves the best performance among the evaluated open-source models and consistently outperforms the Explicit Text CoT baseline, supporting latent-space joint reasoning as a promising path toward stronger omnimodal understanding.