LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL

📅 2025-03-10

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Large multimodal models (LMMs) with 3B parameters suffer from weak reasoning capabilities and poor cross-modal alignment; meanwhile, rule-based reinforcement learning (RL) faces two key bottlenecks in multimodal settings: scarcity of high-quality multimodal data and degradation of foundational reasoning abilities. Method: We propose a two-stage rule-based RL framework: (1) Foundational Reasoning Enhancement (FRE), which strengthens core reasoning on pure-text data, and (2) Multimodal Generalization Transfer (MGT), which transfers the enhanced reasoning to multimodal tasks. This is the first work to decouple rule-based RL into text-grounded foundation building and multimodal generalization. Contribution/Results: Evaluated on Qwen2.5-VL-Instruct-3B, our method achieves +4.83% average gain on multimodal benchmarks, +4.5% on pure-text benchmarks, and +3.63% on the Soccer complex reasoning task—demonstrating robust performance under ambiguous answers and sparse complex-reasoning samples while significantly reducing reliance on high-quality multimodal annotations.

Technology Category

Application Category

📝 Abstract

Enhancing reasoning in Large Multimodal Models (LMMs) faces unique challenges from the complex interplay between visual perception and logical reasoning, particularly in compact 3B-parameter architectures where architectural constraints limit reasoning capacity and modality alignment. While rule-based reinforcement learning (RL) excels in text-only domains, its multimodal extension confronts two critical barriers: (1) data limitations due to ambiguous answers and scarce complex reasoning examples, and (2) degraded foundational reasoning induced by multimodal pretraining. To address these challenges, we propose extbf{LMM-R1}, a two-stage framework adapting rule-based RL for multimodal reasoning through extbf{Foundational Reasoning Enhancement (FRE)} followed by extbf{Multimodal Generalization Training (MGT)}. The FRE stage first strengthens reasoning abilities using text-only data with rule-based RL, then the MGT stage generalizes these reasoning capabilities to multimodal domains. Experiments on Qwen2.5-VL-Instruct-3B demonstrate that LMM-R1 achieves 4.83% and 4.5% average improvements over baselines in multimodal and text-only benchmarks, respectively, with a 3.63% gain in complex Football Game tasks. These results validate that text-based reasoning enhancement enables effective multimodal generalization, offering a data-efficient paradigm that bypasses costly high-quality multimodal training data.

Problem

Research questions and friction points this paper is trying to address.

Enhance reasoning in 3B-parameter Large Multimodal Models (LMMs).

Overcome data limitations and degraded reasoning in multimodal RL.

Generalize text-based reasoning to multimodal domains efficiently.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage rule-based RL for multimodal reasoning

Foundational Reasoning Enhancement with text-only data

Multimodal Generalization Training for enhanced reasoning

🔎 Similar Papers

Long-Horizon Planning for Multi-Agent Robots in Partially Observable Environments