SAM-R1: Leveraging SAM for Reward Feedback in Multimodal Segmentation via Reinforcement Learning

📅 2025-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the bottleneck of relying on costly, fine-grained human annotations for reasoning supervision in multimodal image segmentation, this paper proposes an end-to-end reinforcement learning framework based on Proximal Policy Optimization (PPO), eliminating the need for explicit reasoning-step annotations. Methodologically, it integrates multimodal large language model (MLLM) fine-tuning, fine-grained reward modeling, and task-adaptive objective optimization. Its key contributions are twofold: (1) the first introduction of fine-grained segmentation modeling into MLLM reasoning training; and (2) the novel use of the Segment Anything Model (SAM) as a dynamic reward provider, generating high-quality, pixel-level dense rewards that enforce alignment between segmentation outputs and underlying reasoning logic. Evaluated on multiple benchmarks, the method achieves state-of-the-art performance using only 3K training samples—demonstrating significant improvements in segmentation accuracy, reasoning interpretability, and cross-task generalization.

Technology Category

Application Category

📝 Abstract
Leveraging multimodal large models for image segmentation has become a prominent research direction. However, existing approaches typically rely heavily on manually annotated datasets that include explicit reasoning processes, which are costly and time-consuming to produce. Recent advances suggest that reinforcement learning (RL) can endow large models with reasoning capabilities without requiring such reasoning-annotated data. In this paper, we propose SAM-R1, a novel framework that enables multimodal large models to perform fine-grained reasoning in image understanding tasks. Our approach is the first to incorporate fine-grained segmentation settings during the training of multimodal reasoning models. By integrating task-specific, fine-grained rewards with a tailored optimization objective, we further enhance the model's reasoning and segmentation alignment. We also leverage the Segment Anything Model (SAM) as a strong and flexible reward provider to guide the learning process. With only 3k training samples, SAM-R1 achieves strong performance across multiple benchmarks, demonstrating the effectiveness of reinforcement learning in equipping multimodal models with segmentation-oriented reasoning capabilities.
Problem

Research questions and friction points this paper is trying to address.

Reducing reliance on costly manual reasoning-annotated datasets
Enabling fine-grained reasoning in multimodal segmentation tasks
Integrating SAM for efficient reward-guided reinforcement learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses reinforcement learning for multimodal segmentation
Integrates fine-grained rewards with optimization objective
Leverages SAM as flexible reward provider
🔎 Similar Papers
No similar papers found.
Jiaqi Huang
Jiaqi Huang
University of Central Missouri
CybersecurityIoV
D
Dijkstra Xu
J
Jun Zhou
Tsinghua University
T
Ting Liu
Tsinghua University
Yicheng Xiao
Yicheng Xiao
Tsinghua University
Artificial IntelligenceMultimodal Learning
M
Mingwen Ou
Tsinghua University
B
Bowen Ji
Tsinghua University
Xiu Li
Xiu Li
Bytedance Seed
Computer VisionComputer Graphics3D Vision
K
Kehong Yuan
Tsinghua University