Forest Before Trees: Latent Superposition for Efficient Visual Reasoning

📅 2026-01-11

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

186K/year

🤖 AI Summary

This work addresses the limitations of current vision-language models in multi-step reasoning, which suffer from a visual information bandwidth bottleneck and premature semantic collapse due to discrete textualization. To overcome these challenges, the authors propose the Laser paradigm, which establishes a “forest-before-trees” cognitive hierarchy through Dynamic Window Alignment Learning (DWAL), latent state hyper-positional representations, and a self-refinement stacking mechanism. This approach maintains probabilistic superposition of global features while progressively attending to local details. Evaluated across six benchmarks, Laser outperforms the Monet baseline by an average of 5.03%, reduces reasoning tokens by over 97%, and demonstrates exceptional out-of-distribution generalization, achieving a strong balance among reasoning efficiency, stability, and interpretability.

Technology Category

Application Category

📝 Abstract

While Chain-of-Thought empowers Large Vision-Language Models with multi-step reasoning, explicit textual rationales suffer from an information bandwidth bottleneck, where continuous visual details are discarded during discrete tokenization. Recent latent reasoning methods attempt to address this challenge, but often fall prey to premature semantic collapse due to rigid autoregressive objectives. In this paper, we propose Laser, a novel paradigm that reformulates visual deduction via Dynamic Windowed Alignment Learning (DWAL). Instead of forcing a point-wise prediction, Laser aligns the latent state with a dynamic validity window of future semantics. This mechanism enforces a"Forest-before-Trees"cognitive hierarchy, enabling the model to maintain a probabilistic superposition of global features before narrowing down to local details. Crucially, Laser maintains interpretability via decodable trajectories while stabilizing unconstrained learning via Self-Refined Superposition. Extensive experiments on 6 benchmarks demonstrate that Laser achieves state-of-the-art performance among latent reasoning methods, surpassing the strong baseline Monet by 5.03% on average. Notably, it achieves these gains with extreme efficiency, reducing inference tokens by more than 97%, while demonstrating robust generalization to out-of-distribution domains.

Problem

Research questions and friction points this paper is trying to address.

visual reasoning

latent reasoning

information bottleneck

semantic collapse

vision-language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Latent Superposition

Dynamic Windowed Alignment Learning

Visual Reasoning