WeThink: Toward General-purpose Vision-Language Reasoning via Reinforcement Learning

📅 2025-06-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited domain-specific reasoning capabilities—particularly in mathematics and visual perception—of current vision-language models (VLMs), aiming to establish an end-to-end trainable paradigm for general multimodal reasoning. Methodologically, we propose a scalable automated pipeline for multimodal question-answer synthesis and release WeThink, the first open-source dataset comprising over 120K samples with fine-grained, step-by-step reasoning path annotations. We further design a hybrid reward mechanism integrating symbolic rule verification and large-model-based evaluation to guide reinforcement learning optimization. Experiments demonstrate substantial improvements across 14 diverse MLLM benchmarks spanning mathematical reasoning, image-text understanding, and cross-modal commonsense reasoning, validating the framework’s generalizability and scalability. Key contributions include: (1) the first general-purpose vision-language reasoning dataset; (2) an automated synthesis and hybrid-reward-driven RL training paradigm; and (3) a systematic evaluation framework for cross-domain reasoning capability.

Technology Category

Application Category

📝 Abstract
Building on the success of text-based reasoning models like DeepSeek-R1, extending these capabilities to multimodal reasoning holds great promise. While recent works have attempted to adapt DeepSeek-R1-style reinforcement learning (RL) training paradigms to multimodal large language models (MLLM), focusing on domain-specific tasks like math and visual perception, a critical question remains: How can we achieve the general-purpose visual-language reasoning through RL? To address this challenge, we make three key efforts: (1) A novel Scalable Multimodal QA Synthesis pipeline that autonomously generates context-aware, reasoning-centric question-answer (QA) pairs directly from the given images. (2) The open-source WeThink dataset containing over 120K multimodal QA pairs with annotated reasoning paths, curated from 18 diverse dataset sources and covering various question domains. (3) A comprehensive exploration of RL on our dataset, incorporating a hybrid reward mechanism that combines rule-based verification with model-based assessment to optimize RL training efficiency across various task domains. Across 14 diverse MLLM benchmarks, we demonstrate that our WeThink dataset significantly enhances performance, from mathematical reasoning to diverse general multimodal tasks. Moreover, we show that our automated data pipeline can continuously increase data diversity to further improve model performance.
Problem

Research questions and friction points this paper is trying to address.

Achieving general-purpose vision-language reasoning via reinforcement learning
Generating scalable multimodal QA pairs for diverse reasoning tasks
Enhancing MLLM performance across varied benchmarks with hybrid rewards
Innovation

Methods, ideas, or system contributions that make the work stand out.

Scalable Multimodal QA Synthesis pipeline
Open-source WeThink dataset with reasoning paths
Hybrid reward mechanism for RL training
🔎 Similar Papers
No similar papers found.
J
Jie Yang
WeChat Vision, Tencent
Feipeng Ma
Feipeng Ma
University of Science and Technology of China
Computer Vision
Z
Zitian Wang
WeChat Vision, Tencent
Dacheng Yin
Dacheng Yin
University of Science and Technology of China
speech enhancementrepresentation learningspeech editing
K
Kang Rong
WeChat Vision, Tencent
F
Fengyun Rao
WeChat Vision, Tencent
R
Ruimao Zhang
Sun Yat-sen University