CoLVR: Enhancing Exploratory Latent Visual Reasoning via Contrastive Optimization

📅 2026-05-09

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This work addresses the limited exploratory capacity of existing latent-variable visual reasoning approaches, which rely on rigid alignment objectives. To overcome this constraint, the authors propose a latent-variable contrastive training framework that enhances exploration in the latent space of multimodal large language models. The method generates diverse latent representations through angular perturbations and introduces a trajectory-based contrastive reward mechanism to guide reinforcement learning fine-tuning. By integrating continuous latent state propagation with contrastive optimization, the approach significantly improves performance—yielding gains of 5.83% and 8.00% on the VSP and Jigsaw tasks, respectively—and surpasses current latent-variable models by 3.40% on the out-of-domain MMStar benchmark. These results demonstrate a marked enhancement in both the diversity and generalization capability of visual reasoning.

📝 Abstract

Due to the potential for exploratory reasoning of Latent Visual Reasoning, recent works tend to enable MLLMs (Multimodal Large Language Models) to perform visual reasoning by propagating continuous hidden states instead of decoding intermediate steps into discrete tokens. However, existing works typically rely on hard alignment objectives to force latent representations to match predefined visual features, thereby severely limiting the exploratory of latent reasoning process. To address this problem, we propose CoLVR (Contrastive Optimization for Latent Visual Reasoning). To obtain a more exploratory visual reasoning, CoLVR introduces a latent contrastive training framework. Firstly, CoLVR learns diverse and exploratory representations with a latent contrastive objective guided by angle-based perturbation, which expands the semantic latent space and avoids over-constrained embedding. Then, CoLVR employs a latent trajectory contrastive reward for RL (Reinforcement Learning) post-training to enable fine-grained optimization of latent visual reasoning process and thus fostering diverse reasoning behaviors. Experiments demonstrate that CoLVR significantly enhances the exploratory capability of latent representations, achieving average improvements of 5.83% on VSP and 8.00% on Jigsaw, while also outperforming existing latent models on out of domain benchmarks, with a 3.40% gain on MMStar. The data, codes, and models are released at https://github.com/Oscar-dzy/CoLVR.

Problem

Research questions and friction points this paper is trying to address.

Latent Visual Reasoning

Exploratory Reasoning

Contrastive Optimization

Multimodal Large Language Models

Latent Representations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Latent Visual Reasoning

Contrastive Optimization

Multimodal Large Language Models