Vision-aligned Latent Reasoning for Multi-modal Large Language Model

📅 2026-02-04

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

This work addresses the performance degradation of multimodal large language models in long-context, multi-step reasoning tasks, which often stems from visual information dilution. To mitigate this issue, the authors propose the Vision-aligned Latent Reasoning (VaLR) framework, which dynamically generates implicit tokens aligned with the visual encoder prior to each step of chain-of-thought reasoning, thereby continuously injecting perceptual cues. VaLR is the first method to enable dynamic injection of vision-aligned representations during reasoning, effectively alleviating information dilution and demonstrating strong test-time scaling capabilities. Evaluated on benchmarks such as VSI-Bench—designed to emphasize long-context understanding and fine-grained visual perception—VaLR significantly improves accuracy from 33.0% to 52.9%, representing a 19.9-percentage-point gain over Qwen2.5-VL.

Technology Category

Application Category

📝 Abstract

Despite recent advancements in Multi-modal Large Language Models (MLLMs) on diverse understanding tasks, these models struggle to solve problems which require extensive multi-step reasoning. This is primarily due to the progressive dilution of visual information during long-context generation, which hinders their ability to fully exploit test-time scaling. To address this issue, we introduce Vision-aligned Latent Reasoning (VaLR), a simple, yet effective reasoning framework that dynamically generates vision-aligned latent tokens before each Chain of Thought reasoning step, guiding the model to reason based on perceptual cues in the latent space. Specifically, VaLR is trained to preserve visual knowledge during reasoning by aligning intermediate embeddings of MLLM with those from vision encoders. Empirical results demonstrate that VaLR consistently outperforms existing approaches across a wide range of benchmarks requiring long-context understanding or precise visual perception, while exhibiting test-time scaling behavior not observed in prior MLLMs. In particular, VaLR improves the performance significantly from 33.0% to 52.9% on VSI-Bench, achieving a 19.9%p gain over Qwen2.5-VL.

Problem

Research questions and friction points this paper is trying to address.

Multi-modal Large Language Models

multi-step reasoning

visual information dilution

long-context generation

test-time scaling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-aligned Latent Reasoning

Multi-modal Large Language Models

Chain of Thought