LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning

📅 2026-01-15
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
In multimodal implicit reasoning, lightweight student models often rely excessively on linguistic priors while neglecting genuine visual perception, leading to significant divergence in visual attention from their teacher counterparts. To address this, this work proposes a novel paradigm that aligns the "latent visual thinking" of student and teacher models. Specifically, it employs autoregressive reconstruction of the teacher’s visual semantics and attention trajectories to align their dynamic visual reasoning processes prior to text generation. A curriculum-based sensory gating mechanism is further introduced to suppress shortcut learning. This approach represents the first explicit modeling and transfer of the teacher’s dynamic visual attention, achieving up to a 16.9% performance gain on complex reasoning tasks and enabling a 3B-parameter model to surpass both larger open-source models and closed-source systems such as GPT-4o.

Technology Category

Application Category

📝 Abstract
Current multimodal latent reasoning often relies on external supervision (e.g., auxiliary images), ignoring intrinsic visual attention dynamics. In this work, we identify a critical Perception Gap in distillation: student models frequently mimic a teacher's textual output while attending to fundamentally divergent visual regions, effectively relying on language priors rather than grounded perception. To bridge this, we propose LaViT, a framework that aligns latent visual thoughts rather than static embeddings. LaViT compels the student to autoregressively reconstruct the teacher's visual semantics and attention trajectories prior to text generation, employing a curriculum sensory gating mechanism to prevent shortcut learning. Extensive experiments show that LaViT significantly enhances visual grounding, achieving up to +16.9% gains on complex reasoning tasks and enabling a compact 3B model to outperform larger open-source variants and proprietary models like GPT-4o.
Problem

Research questions and friction points this paper is trying to address.

Perception Gap
multimodal reasoning
visual grounding
knowledge distillation
latent reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

latent visual alignment
visual grounding
multimodal reasoning
attention trajectory
perception gap
🔎 Similar Papers
No similar papers found.
L
Linquan Wu
City University of Hong Kong
T
Tianxiang Jiang
University of Science and Technology of China
Yifei Dong
Yifei Dong
KTH Royal Institute of Technology
Robotic manipulation
H
Haoyu Yang
University of Electronic Science and Technology of China
Fengji Zhang
Fengji Zhang
Department of Computer Science, City University of Hong Kong
Software EngineeringLarge Language Models
S
Shichang Meng
City University of Hong Kong
A
Ai Xuan
City University of Hong Kong
Linqi Song
Linqi Song
Associate Professor, Department of Computer Science, City University of Hong Kong
Information TheoryFederated LearningNatural Language Processing
J
J. Keung
City University of Hong Kong