LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning

πŸ“… 2026-05-21
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

194K/year
πŸ€– AI Summary
This work addresses the limitations of existing multimodal large language models in fine-grained audio-visual joint reasoning, which often rely on explicit textual chains of thought that compress continuous signals, disrupt temporal alignment, and suffer from linguistic priors. To overcome these issues, the authors propose LatentOmni, a framework that interleaves textual reasoning with perceptual state modeling within a unified audio-visual latent space, enabling tight cross-modal joint reasoning. Key innovations include feature-level supervised alignment between task-relevant perceptual features and reasoning states, Omni-Sync positional encoding to preserve temporal consistency, and the introduction of LatentOmni-Instruct-35Kβ€”the first dataset for interleaved audio-visual latent reasoning. Experiments demonstrate that the proposed method significantly outperforms existing open-source models and explicit chain-of-thought baselines across multiple audio-visual reasoning benchmarks, validating the efficacy of latent-space joint reasoning.
πŸ“ Abstract
Joint audio-visual reasoning is essential for omnimodal understanding, yet current multimodal large language models (MLLMs) still struggle when reasoning requires fine-grained evidence from both modalities. A central limitation is that explicit text-based chain-of-thought (CoT) compresses continuous audio-visual signals into discrete tokens, weakening temporal grounding and shifting intermediate reasoning toward language priors. We argue that a unified latent space is a better medium for such reasoning because it preserves dense sensory information while remaining compatible with autoregressive generation. Based on this insight, we propose \textbf{LatentOmni}, a cross-modal reasoning framework that interleaves textual reasoning with audio-visual latent states. LatentOmni introduces feature-level supervision to align latent reasoning states with task-relevant sensory features and uses Omni-Sync Position Embedding (OSPE) to maintain temporal consistency between latent audio and visual states. We further construct \textbf{LatentOmni-Instruct-35K}, a dataset of audio-visual interleaved reasoning trajectories for supervising latent-space reasoning. Comprehensive evaluation across multiple audio-visual reasoning benchmarks demonstrates that LatentOmni achieves the best performance among the evaluated open-source models and consistently outperforms the Explicit Text CoT baseline, supporting latent-space joint reasoning as a promising path toward stronger omnimodal understanding.
Problem

Research questions and friction points this paper is trying to address.

omnimodal understanding
audio-visual reasoning
multimodal large language models
latent space
temporal grounding
Innovation

Methods, ideas, or system contributions that make the work stand out.

latent-space reasoning
audio-visual fusion
Omni-Sync Position Embedding
multimodal large language models
feature-level supervision
πŸ”Ž Similar Papers
No similar papers found.
Yifan Dai
Yifan Dai
Hunan University
LLMAgentAI4Science
Z
Zhenhua Wu
Kling Team, Kuaishou Technology
Bohan Zeng
Bohan Zeng
PhD student, Peking University
Data-Centric AIComputer VisionDiffusion Model3D
D
Daili Hua
Peking University
Jialing Liu
Jialing Liu
University of California, San Francisco
StrokeBrain InjuryFunctional RecoveryNeurogenesisVascular Remodeling
B
Bozhou Li
Peking University; Kling Team, Kuaishou Technology
Yuran Wang
Yuran Wang
Peking University
Embodied AIComputer Vision
C
Chengzhuo Tong
Peking University; Kling Team, Kuaishou Technology
Hao Liang
Hao Liang
Peking University
Data Centric Machine LearningLarge Language ModelsMultimodal Large Language Models
X
Xiaochen Ma
HKUST
Junbo Niu
Junbo Niu
Peking University
Foundation Model
T
Tianyu Guo
Peking University
Yang Shi
Yang Shi
Peking University
Multimodal LearningCausal InferenceReinforcement Learning
Y
Yue Ding
CASIA; Kling Team, Kuaishou Technology
Y
Yiyan Ji
Nanjing University; Kling Team, Kuaishou Technology
B
Bingyin Mei
Tsinghua University
Yushuo Guan
Yushuo Guan
Peking University
VLMDiffusion Model
Yuanxing Zhang
Yuanxing Zhang
Kuaishou Technology
Recommender SystemLarge Language ModelVideo Understanding
Pengfei Wan
Pengfei Wan
Head of Kling Video Generation Models, Kuaishou Technology
Generative ModelsComputer VisionMultimodal AIComputer Graphics
Fangcheng Fu
Fangcheng Fu
Shanghai Jiao Tong University
machine learningdeep learningMLSysdistributed computation
Wentao Zhang
Wentao Zhang
Institute of Physics, Chinese Academy of Sciences
photoemissionsuperconductivitycupratehtsctime-resolved