MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources

📅 2025-09-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large multimodal reasoning models suffer from scarce high-quality chain-of-thought (CoT) data and unstable reinforcement learning (RL) training. To address these challenges, we propose Variance-Aware Sampling (VAS), a novel sampling strategy that jointly models outcome variance and reasoning-path diversity to enhance reward signal quality; we theoretically establish its lower-bound guarantee on policy gradient estimation. Building upon the Group Relative Policy Optimization framework, we design a Variance Promotion Score to guide high-quality CoT data selection. Furthermore, we construct and publicly release the first large-scale cold-start CoT dataset and RL fine-tuning dataset for multimodal reasoning, establishing a reproducible multimodal reasoning benchmark. Extensive experiments on multiple mathematical reasoning benchmarks demonstrate significant improvements over state-of-the-art methods, validating both the effectiveness and generalizability of VAS and our curated high-quality data resources.

Technology Category

Application Category

📝 Abstract
Large multimodal reasoning models have achieved rapid progress, but their advancement is constrained by two major limitations: the absence of open, large-scale, high-quality long chain-of-thought (CoT) data, and the instability of reinforcement learning (RL) algorithms in post-training. Group Relative Policy Optimization (GRPO), the standard framework for RL fine-tuning, is prone to gradient vanishing when reward variance is low, which weakens optimization signals and impairs convergence. This work makes three contributions: (1) We propose Variance-Aware Sampling (VAS), a data selection strategy guided by Variance Promotion Score (VPS) that combines outcome variance and trajectory diversity to promote reward variance and stabilize policy optimization. (2) We release large-scale, carefully curated resources containing ~1.6M long CoT cold-start data and ~15k RL QA pairs, designed to ensure quality, difficulty, and diversity, along with a fully reproducible end-to-end training codebase. (3) We open-source a family of multimodal reasoning models in multiple scales, establishing standardized baselines for the community. Experiments across mathematical reasoning benchmarks demonstrate the effectiveness of both the curated data and the proposed VAS. Comprehensive ablation studies and analyses provide further insight into the contributions of each component. In addition, we theoretically establish that reward variance lower-bounds the expected policy gradient magnitude, with VAS serving as a practical mechanism to realize this guarantee. Our code, data, and checkpoints are available at https://github.com/LengSicong/MMR1.
Problem

Research questions and friction points this paper is trying to address.

Addresses the absence of open large-scale high-quality chain-of-thought data
Solves reinforcement learning instability due to low reward variance in training
Overcomes gradient vanishing in policy optimization that impairs model convergence
Innovation

Methods, ideas, or system contributions that make the work stand out.

Variance-Aware Sampling strategy to stabilize policy optimization
Large-scale curated resources with 1.6M long chain-of-thought data
Open-source multimodal reasoning models establishing standardized baselines
🔎 Similar Papers
No similar papers found.
Sicong Leng
Sicong Leng
Nanyang Technological University
Multi-modal Learning
J
Jing Wang
Nanyang Technological University
J
Jiaxi Li
Singapore University of Technology and Design
H
Hao Zhang
DAMO Academy, Alibaba Group
Z
Zhiqiang Hu
DAMO Academy, Alibaba Group
Boqiang Zhang
Boqiang Zhang
Tencent AILab
Y
Yuming Jiang
DAMO Academy, Alibaba Group
H
Hang Zhang
DAMO Academy, Alibaba Group
X
Xin Li
DAMO Academy, Alibaba Group
Lidong Bing
Lidong Bing
MiroMind, Alibaba DAMO, Tencent, CMU, CUHK
Natural Language ProcessingLarge Language ModelsLarge Multimodal Models
Deli Zhao
Deli Zhao
Alibaba DAMO Academy
generative modelsmultimodal learningfoundation models
W
Wei Lu
Nanyang Technological University
Y
Yu Rong
DAMO Academy, Alibaba Group
A
Aixin Sun
Nanyang Technological University
Shijian Lu
Shijian Lu
College of Computing and Data Science, NTU
Image and video analyticscomputer visionmachine learning