Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model

📅 2026-05-12
📈 Citations: 0
Influential: 0
📄 PDF

career value

177K/year
🤖 AI Summary
This work addresses the performance degradation of existing visual latent reasoning methods when extending latent sequences, which stems from information gain collapse and insufficient supervisory signals caused by excessive pooling. To tackle this issue, the authors propose SCOLAR, the first method to explicitly identify and resolve these limitations. SCOLAR employs a lightweight de-transformer to generate auxiliary visual tokens—aligned with the original visual space—in a single pass using the full-sequence hidden states of a large language model. The approach further integrates a three-stage supervised fine-tuning (SFT) protocol with ALPO-based reinforcement learning to optimize the reasoning process. This framework extends the feasible length of latent chains-of-thought by over 30×, achieving state-of-the-art performance in visual-language reasoning among open-source models with a 14.12% improvement over baselines, while significantly enhancing long-sequence reasoning and out-of-distribution generalization capabilities.
📝 Abstract
In language reasoning, longer chains of thought consistently yield better performance, which naturally suggests that visual latent reasoning may likewise benefit from longer latent sequences. However, we discover a counterintuitive phenomenon: the performance of existing latent visual reasoning methods systematically degrades as the latent sequence grows longer. We reveal the root cause: Information Gain Collapse -- autoregressive generation makes each step highly dependent on prior outputs, so subsequent tokens can barely introduce new information. We further identify that heavily pooled ($\geq 128\times$) image embeddings used as supervision targets provide no more signal than meaningless placeholders. Motivated by these insights, we propose SCOLAR (Self-COnsistent LAtent Reasoning), which introduces a lightweight detransformer that leverages the LLM's full-sequence hidden states to generate auxiliary visual tokens in a single shot, with each token independently anchored to the original visual space. Combined with three-stage SFT and ALPO reinforcement learning, SCOLAR extends acceptable latent CoT length by over $30\times$, achieves state-of-the-art among open-source models on real-world reasoning benchmarks (+14.12% over backbone), and demonstrates strong out-of-distribution generalization.
Problem

Research questions and friction points this paper is trying to address.

latent reasoning
vision-language model
information gain collapse
long latent sequence
visual reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-Consistent Latent Reasoning
Information Gain Collapse
Detransformer
Latent Chain-of-Thought
Vision-Language Reasoning
🔎 Similar Papers