OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement

📅 2025-03-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large Vision-Language Models (LVLMs) lack the self-verification and self-correction capabilities inherent in advanced large language models (LLMs), limiting their cross-modal reasoning reliability. Method: We propose the first iterative Supervised Fine-Tuning–Reinforcement Learning (SFT-RL) co-optimization framework for LVLMs. It begins by distilling reasoning chains from a pure-text R1 model to construct high-quality visual reasoning data; then alternates between RL-based policy optimization and SFT updates, forming a closed loop of “reasoning generation → quality assessment → data augmentation → model refinement.” Contribution/Results: Our method achieves significant performance gains on multimodal mathematical reasoning benchmarks—including MathVista, MathVerse, and MathVision. We publicly release the trained models, code, and a high-quality visual reasoning dataset, establishing a new paradigm for interpretable and verifiable reasoning in LVLMs.

Technology Category

Application Category

📝 Abstract
Recent advancements demonstrated by DeepSeek-R1 have shown that complex reasoning abilities in large language models (LLMs), including sophisticated behaviors such as self-verification and self-correction, can be achieved by RL with verifiable rewards and significantly improves model performance on challenging tasks such as AIME. Motivated by these findings, our study investigates whether similar reasoning capabilities can be successfully integrated into large vision-language models (LVLMs) and assesses their impact on challenging multimodal reasoning tasks. We consider an approach that iteratively leverages supervised fine-tuning (SFT) on lightweight training data and Reinforcement Learning (RL) to further improve model generalization. Initially, reasoning capabilities were distilled from pure-text R1 models by generating reasoning steps using high-quality captions of the images sourced from diverse visual datasets. Subsequently, iterative RL training further enhance reasoning skills, with each iteration's RL-improved model generating refined SFT datasets for the next round. This iterative process yielded OpenVLThinker, a LVLM exhibiting consistently improved reasoning performance on challenging benchmarks such as MathVista, MathVerse, and MathVision, demonstrating the potential of our strategy for robust vision-language reasoning. The code, model and data are held at https://github.com/yihedeng9/OpenVLThinker.
Problem

Research questions and friction points this paper is trying to address.

Integrating complex reasoning into vision-language models
Improving multimodal reasoning via iterative self-improvement
Enhancing performance on benchmarks like MathVista
Innovation

Methods, ideas, or system contributions that make the work stand out.

Iterative SFT and RL for LVLM improvement
Distilling reasoning from text models to LVLMs
Self-improving LVLM via refined SFT datasets
🔎 Similar Papers
No similar papers found.