LanteRn: Latent Visual Structured Reasoning

📅 2026-03-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited performance of existing large multimodal language models on fine-grained spatial and visual reasoning tasks, which typically stems from their reliance on converting visual inputs into textual tokens before reasoning. To overcome this limitation, the authors propose LanteRn, a novel framework that, for the first time, generates and leverages continuous latent visual thought embeddings within the model itself, enabling alternating reasoning between linguistic and visual representations directly in the latent space—without depending on external modules or computationally expensive pixel-level operations. Built upon a vision-language Transformer architecture and trained via a two-stage strategy combining supervised fine-tuning and reinforcement learning, LanteRn effectively aligns its latent reasoning process with downstream task objectives. The approach demonstrates consistent performance gains across three perception-intensive benchmarks—VisCoT, V*, and Blink—validating the efficacy and generality of the proposed latent visual reasoning mechanism.

Technology Category

Application Category

📝 Abstract
While language reasoning models excel in many tasks, visual reasoning remains challenging for current large multimodal models (LMMs). As a result, most LMMs default to verbalizing perceptual content into text, a strong limitation for tasks requiring fine-grained spatial and visual understanding. While recent approaches take steps toward thinking with images by invoking tools or generating intermediate images, they either rely on external modules, or incur unnecessary computation by reasoning directly in pixel space. In this paper, we introduce LanteRn, a framework that enables LMMs to interleave language with compact latent visual representations, allowing visual reasoning to occur directly in latent space. LanteRn augments a vision-language transformer with the ability to generate and attend to continuous visual thought embeddings during inference. We train the model in two stages: supervised fine-tuning to ground visual features in latent states, followed by reinforcement learning to align latent reasoning with task-level utility. We evaluate LanteRn on three perception-centric benchmarks (VisCoT, V*, and Blink), observing consistent improvements in visual grounding and fine-grained reasoning. These results suggest that internal latent representations provide a promising direction for more efficient multimodal reasoning.
Problem

Research questions and friction points this paper is trying to address.

visual reasoning
multimodal models
latent representations
fine-grained spatial understanding
vision-language reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

latent visual reasoning
multimodal reasoning
visual thought embeddings
vision-language transformer
reinforcement learning
🔎 Similar Papers
A
André G. Viveiros
Instituto de Telecomunicações; Instituto Superior Técnico, Universidade de Lisboa
Nuno Gonçalves
Nuno Gonçalves
Institute for Systems and Robotics, University of Coimbra
BiometricsComputer VisionSteganographyRoboticsMedical Imaging
M
Matthias Lindemann
Instituto de Telecomunicações
A
André F. T. Martins
Instituto de Telecomunicações; Instituto Superior Técnico, Universidade de Lisboa