LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL

📅 2025-03-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large multimodal models (LMMs) with 3B parameters suffer from weak reasoning capabilities and poor cross-modal alignment; meanwhile, rule-based reinforcement learning (RL) faces two key bottlenecks in multimodal settings: scarcity of high-quality multimodal data and degradation of foundational reasoning abilities. Method: We propose a two-stage rule-based RL framework: (1) Foundational Reasoning Enhancement (FRE), which strengthens core reasoning on pure-text data, and (2) Multimodal Generalization Transfer (MGT), which transfers the enhanced reasoning to multimodal tasks. This is the first work to decouple rule-based RL into text-grounded foundation building and multimodal generalization. Contribution/Results: Evaluated on Qwen2.5-VL-Instruct-3B, our method achieves +4.83% average gain on multimodal benchmarks, +4.5% on pure-text benchmarks, and +3.63% on the Soccer complex reasoning task—demonstrating robust performance under ambiguous answers and sparse complex-reasoning samples while significantly reducing reliance on high-quality multimodal annotations.

Technology Category

Application Category

📝 Abstract
Enhancing reasoning in Large Multimodal Models (LMMs) faces unique challenges from the complex interplay between visual perception and logical reasoning, particularly in compact 3B-parameter architectures where architectural constraints limit reasoning capacity and modality alignment. While rule-based reinforcement learning (RL) excels in text-only domains, its multimodal extension confronts two critical barriers: (1) data limitations due to ambiguous answers and scarce complex reasoning examples, and (2) degraded foundational reasoning induced by multimodal pretraining. To address these challenges, we propose extbf{LMM-R1}, a two-stage framework adapting rule-based RL for multimodal reasoning through extbf{Foundational Reasoning Enhancement (FRE)} followed by extbf{Multimodal Generalization Training (MGT)}. The FRE stage first strengthens reasoning abilities using text-only data with rule-based RL, then the MGT stage generalizes these reasoning capabilities to multimodal domains. Experiments on Qwen2.5-VL-Instruct-3B demonstrate that LMM-R1 achieves 4.83% and 4.5% average improvements over baselines in multimodal and text-only benchmarks, respectively, with a 3.63% gain in complex Football Game tasks. These results validate that text-based reasoning enhancement enables effective multimodal generalization, offering a data-efficient paradigm that bypasses costly high-quality multimodal training data.
Problem

Research questions and friction points this paper is trying to address.

Enhance reasoning in 3B-parameter Large Multimodal Models (LMMs).
Overcome data limitations and degraded reasoning in multimodal RL.
Generalize text-based reasoning to multimodal domains efficiently.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage rule-based RL for multimodal reasoning
Foundational Reasoning Enhancement with text-only data
Multimodal Generalization Training for enhanced reasoning
Yingzhe Peng
Yingzhe Peng
Southeast University
LLMNLPMultimodal
G
Gongrui Zhang
Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China
M
Miaosen Zhang
Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China
Zhiyuan You
Zhiyuan You
MMLab, The Chinese University of Hong Kong
Deep LearningComputer VisionLow-level Vision
J
Jie Liu
The Chinese University of Hong Kong
Q
Qipeng Zhu
Fudan University
K
Kai Yang
Ant Group
X
Xingzhong Xu
Ant Group
Xin Geng
Xin Geng
School of Computer Science and Engineering, Southeast University
Artificial IntelligencePattern RecognitionMachine Learning
X
Xu Yang
Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China