Monet: Reasoning in Latent Visual Space Beyond Images and Language

📅 2025-11-26

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Existing visual reasoning methods rely on external tools, limiting their ability to perform abstract and flexible reasoning directly within latent visual spaces. Method: This paper introduces a novel paradigm for multi-step visual reasoning entirely within the latent space: it employs continuous visual embeddings as intermediate “thought” representations, eliminating conventional tool-calling mechanisms. We propose a three-stage distillation-based supervised fine-tuning procedure coupled with VLPO (Vision-Latent Policy Optimization), a reinforcement learning algorithm that explicitly incorporates latent representations into policy updates for the first time. The Monet-7B model is trained on 125K interleaved image-text chain-of-thought examples. Contribution/Results: Experiments demonstrate significant improvements over state-of-the-art methods on both real-world perception and abstract visual reasoning benchmarks, with strong cross-distribution generalization capability.

Technology Category

Application Category

📝 Abstract

"Thinking with images" has emerged as an effective paradigm for advancing visual reasoning, extending beyond text-only chains of thought by injecting visual evidence into intermediate reasoning steps. However, existing methods fall short of human-like abstract visual thinking, as their flexibility is fundamentally limited by external tools. In this work, we introduce Monet, a training framework that enables multimodal large language models (MLLMs) to reason directly within the latent visual space by generating continuous embeddings that function as intermediate visual thoughts. We identify two core challenges in training MLLMs for latent visual reasoning: high computational cost in latent-vision alignment and insufficient supervision over latent embeddings, and address them with a three-stage distillation-based supervised fine-tuning (SFT) pipeline. We further reveal a limitation of applying GRPO to latent reasoning: it primarily enhances text-based reasoning rather than latent reasoning. To overcome this, we propose VLPO (Visual-latent Policy Optimization), a reinforcement learning method that explicitly incorporates latent embeddings into policy gradient updates. To support SFT, we construct Monet-SFT-125K, a high-quality text-image interleaved CoT dataset containing 125K real-world, chart, OCR, and geometry CoTs. Our model, Monet-7B, shows consistent gains across real-world perception and reasoning benchmarks and exhibits strong out-of-distribution generalization on challenging abstract visual reasoning tasks. We also empirically analyze the role of each training component and discuss our early unsuccessful attempts, providing insights for future developments in visual latent reasoning. Our model, data, and code are available at https://github.com/NOVAglow646/Monet.

Problem

Research questions and friction points this paper is trying to address.

Enabling multimodal models to reason directly in latent visual space

Overcoming computational cost and supervision challenges in latent-vision alignment

Developing reinforcement learning for visual reasoning beyond text-based methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates continuous embeddings as intermediate visual thoughts

Uses three-stage distillation-based supervised fine-tuning pipeline

Proposes VLPO reinforcement learning for latent reasoning optimization

🔎 Similar Papers

Using a CNN Model to Assess Paintings' Creativity