MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources

📅 2025-09-25

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Large multimodal reasoning models suffer from scarce high-quality chain-of-thought (CoT) data and unstable reinforcement learning (RL) training. To address these challenges, we propose Variance-Aware Sampling (VAS), a novel sampling strategy that jointly models outcome variance and reasoning-path diversity to enhance reward signal quality; we theoretically establish its lower-bound guarantee on policy gradient estimation. Building upon the Group Relative Policy Optimization framework, we design a Variance Promotion Score to guide high-quality CoT data selection. Furthermore, we construct and publicly release the first large-scale cold-start CoT dataset and RL fine-tuning dataset for multimodal reasoning, establishing a reproducible multimodal reasoning benchmark. Extensive experiments on multiple mathematical reasoning benchmarks demonstrate significant improvements over state-of-the-art methods, validating both the effectiveness and generalizability of VAS and our curated high-quality data resources.

Technology Category

Application Category

📝 Abstract

Large multimodal reasoning models have achieved rapid progress, but their advancement is constrained by two major limitations: the absence of open, large-scale, high-quality long chain-of-thought (CoT) data, and the instability of reinforcement learning (RL) algorithms in post-training. Group Relative Policy Optimization (GRPO), the standard framework for RL fine-tuning, is prone to gradient vanishing when reward variance is low, which weakens optimization signals and impairs convergence. This work makes three contributions: (1) We propose Variance-Aware Sampling (VAS), a data selection strategy guided by Variance Promotion Score (VPS) that combines outcome variance and trajectory diversity to promote reward variance and stabilize policy optimization. (2) We release large-scale, carefully curated resources containing ~1.6M long CoT cold-start data and ~15k RL QA pairs, designed to ensure quality, difficulty, and diversity, along with a fully reproducible end-to-end training codebase. (3) We open-source a family of multimodal reasoning models in multiple scales, establishing standardized baselines for the community. Experiments across mathematical reasoning benchmarks demonstrate the effectiveness of both the curated data and the proposed VAS. Comprehensive ablation studies and analyses provide further insight into the contributions of each component. In addition, we theoretically establish that reward variance lower-bounds the expected policy gradient magnitude, with VAS serving as a practical mechanism to realize this guarantee. Our code, data, and checkpoints are available at https://github.com/LengSicong/MMR1.

Problem

Research questions and friction points this paper is trying to address.

Addresses the absence of open large-scale high-quality chain-of-thought data

Solves reinforcement learning instability due to low reward variance in training

Overcomes gradient vanishing in policy optimization that impairs model convergence

Innovation

Methods, ideas, or system contributions that make the work stand out.

Variance-Aware Sampling strategy to stabilize policy optimization

Large-scale curated resources with 1.6M long chain-of-thought data

Open-source multimodal reasoning models establishing standardized baselines

🔎 Similar Papers

No similar papers found.