Retrieve, Integrate, and Synthesize: Spatial-Semantic Grounded Latent Visual Reasoning

📅 2026-05-07
📈 Citations: 0
Influential: 0
📄 PDF

career value

175K/year
🤖 AI Summary
This work addresses the limitations of current vision-language models in fine-grained perception, where compressing visual information into discrete text introduces an information bottleneck, and existing implicit reasoning approaches suffer from manifold incompatibility, trajectory drift, and instance irrelevance. To overcome these issues, the authors propose the RIS framework, which achieves continuous implicit reasoning with dual spatial-semantic alignment for the first time. By constructing a stepwise reasoning dataset incorporating bounding boxes and region descriptions, and integrating a progressive attention bottleneck with short linguistic transition tokens, RIS anchors latent states to both spatial and semantic evidence, yielding diverse, interpretable, and pretraining-compatible reasoning trajectories. Extensive evaluations on V*, HRBench4K/8K, MMVP, and BLINK demonstrate that RIS significantly outperforms existing open-source, closed-source, and implicit reasoning methods, validating its reasoning faithfulness and effectiveness.
📝 Abstract
Multimodal Large Language Models (MLLMs) have made remarkable progress on vision-language reasoning, yet most methods still compress visual evidence into discrete textual thoughts, creating an information bottleneck for fine-grained perception. Recent latent visual reasoning methods attempt to reason in continuous hidden states, but we find that they suffer from insufficient manifold compatibility: latent trajectories drift away from pretrained reasoning circuits, collapse into instance-agnostic patterns, and are often bypassed during answer generation. To address these issues, we propose RIS (Retrieve, Integrate, and Synthesize), a spatial-semantic grounded framework that develops latent reasoning as a compatible extension of pretrained MLLM computation. We first construct a step-wise grounded reasoning dataset with bounding boxes and region-specific semantic descriptions. Built on this supervision, RIS anchors latent tokens to both spatial and semantic evidence, enforces their causal role through a progressive attention bottleneck, and introduces short language transition tokens to bridge synthesized latent states back to vocabulary-aligned decoding. Experiments on V*, HRBench4K, HRBench8K, MMVP, and BLINK show consistent improvements over closed/open-source and latent reasoning baselines. Further analyses demonstrate that RIS learns diverse, interpretable, and progressively integrated latent trajectories, offering a practical path toward faithful internal visual reasoning in MLLMs.
Problem

Research questions and friction points this paper is trying to address.

latent visual reasoning
manifold compatibility
multimodal large language models
fine-grained perception
visual-language reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

latent visual reasoning
spatial-semantic grounding
multimodal large language models
progressive attention bottleneck
interpretable reasoning trajectories
Jin Cui
Jin Cui
Principal Engineer
Embedded SystemOS Kernel & DriverHypervisor & VirtualizationComputer uArch modellingFPGA & EDA
X
Xinyue Long
School of Software Engineering, State Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University
X
Xunyong Zhang
State Key Laboratory of Human-Machine Hybrid Augmented Intelligence, and Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University
Y
Yadong Zhang
State Key Laboratory of Human-Machine Hybrid Augmented Intelligence, and Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University
C
Chuanchang Su
State Key Laboratory of Human-Machine Hybrid Augmented Intelligence, and Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University
J
Jingye Gan
State Key Laboratory of Human-Machine Hybrid Augmented Intelligence, and Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University
B
Boran Zhao
School of Software Engineering, State Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University
Pengju Ren
Pengju Ren
Professor, Xi'an Jiaotong University