Virgo: A Preliminary Exploration on Reproducing o1-like MLLM

📅 2025-01-03

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

Multimodal large language models (MLLMs) often lack robust long-horizon, logically consistent visual reasoning capabilities. Method: This paper introduces Virgo—a lightweight supervision framework that fine-tunes strong MLLM backbones using only a small amount of pure-text chain-of-thought (CoT) data, without architectural modifications or auxiliary inference modules. Crucially, it leverages the inherent “slow-thinking” capability of language models, demonstrating that textual reasoning trajectories transfer effectively across modalities and more efficiently unlock deep visual reasoning potential than vision-augmented CoT data. Contribution/Results: Experiments show that Virgo significantly improves logical consistency and accuracy on multi-step visual question answering, establishing—for the first time—that minimal text-only supervision suffices to endow MLLMs with stable, long-horizon visual reasoning abilities. The code and dataset are fully open-sourced.

Technology Category

Application Category

📝 Abstract

Recently, slow-thinking reasoning systems, built upon large language models (LLMs), have garnered widespread attention by scaling the thinking time during inference. There is also growing interest in adapting this capability to multimodal large language models (MLLMs). Given that MLLMs handle more complex data semantics across different modalities, it is intuitively more challenging to implement multimodal slow-thinking systems. To address this issue, in this paper, we explore a straightforward approach by fine-tuning a capable MLLM with a small amount of textual long-form thought data, resulting in a multimodal slow-thinking system, Virgo (Visual reasoning with long thought). We find that these long-form reasoning processes, expressed in natural language, can be effectively transferred to MLLMs. Moreover, it seems that such textual reasoning data can be even more effective than visual reasoning data in eliciting the slow-thinking capacities of MLLMs. While this work is preliminary, it demonstrates that slow-thinking capacities are fundamentally associated with the language model component, which can be transferred across modalities or domains. This finding can be leveraged to guide the development of more powerful slow-thinking reasoning systems. We release our resources at https://github.com/RUCAIBox/Virgo.

Problem

Research questions and friction points this paper is trying to address.

Multimodal Tasks

Large Language Models

Image-Text Integration

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal System

Textual Thinking Data

Cross-domain Transferability

🔎 Similar Papers

No similar papers found.