🤖 AI Summary
This work addresses the performance degradation of existing visual latent reasoning methods when extending latent sequences, which stems from information gain collapse and insufficient supervisory signals caused by excessive pooling. To tackle this issue, the authors propose SCOLAR, the first method to explicitly identify and resolve these limitations. SCOLAR employs a lightweight de-transformer to generate auxiliary visual tokens—aligned with the original visual space—in a single pass using the full-sequence hidden states of a large language model. The approach further integrates a three-stage supervised fine-tuning (SFT) protocol with ALPO-based reinforcement learning to optimize the reasoning process. This framework extends the feasible length of latent chains-of-thought by over 30×, achieving state-of-the-art performance in visual-language reasoning among open-source models with a 14.12% improvement over baselines, while significantly enhancing long-sequence reasoning and out-of-distribution generalization capabilities.
📝 Abstract
In language reasoning, longer chains of thought consistently yield better performance, which naturally suggests that visual latent reasoning may likewise benefit from longer latent sequences. However, we discover a counterintuitive phenomenon: the performance of existing latent visual reasoning methods systematically degrades as the latent sequence grows longer. We reveal the root cause: Information Gain Collapse -- autoregressive generation makes each step highly dependent on prior outputs, so subsequent tokens can barely introduce new information. We further identify that heavily pooled ($\geq 128\times$) image embeddings used as supervision targets provide no more signal than meaningless placeholders. Motivated by these insights, we propose SCOLAR (Self-COnsistent LAtent Reasoning), which introduces a lightweight detransformer that leverages the LLM's full-sequence hidden states to generate auxiliary visual tokens in a single shot, with each token independently anchored to the original visual space. Combined with three-stage SFT and ALPO reinforcement learning, SCOLAR extends acceptable latent CoT length by over $30\times$, achieves state-of-the-art among open-source models on real-world reasoning benchmarks (+14.12% over backbone), and demonstrates strong out-of-distribution generalization.