One RL to See Them All: Visual Triple Unified Reinforcement Learning

📅 2025-05-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing reinforcement learning (RL) applications for vision-language models (VLMs) focus predominantly on reasoning tasks, leaving perception-intensive tasks—such as object detection, visual grounding, and OCR—poorly modeled. This work introduces V-Triune, the first unified RL framework that jointly enhances VLM capabilities across both visual reasoning (e.g., mathematical and chart understanding) and perception tasks. Its core contributions are: (1) a vision-triune RL architecture featuring sample-level formatting, validator-level reward assignment, and source-level monitoring; (2) a dynamic IoU-based reward mechanism enabling adaptive, progressive optimization of perceptual outputs; and (3) an offline RL training framework built upon open-source 7B/32B VLMs, integrating multi-expert validators and fine-grained data-source diagnostics. The derived model, Orsta, achieves average gains of 2.1–14.1 points on MEGA-Bench Core, significantly improves performance across eight diverse visual tasks, and demonstrates strong downstream generalization.

Technology Category

Application Category

📝 Abstract
Reinforcement learning (RL) has significantly advanced the reasoning capabilities of vision-language models (VLMs). However, the use of RL beyond reasoning tasks remains largely unexplored, especially for perceptionintensive tasks like object detection and grounding. We propose V-Triune, a Visual Triple Unified Reinforcement Learning system that enables VLMs to jointly learn visual reasoning and perception tasks within a single training pipeline. V-Triune comprises triple complementary components: Sample-Level Data Formatting (to unify diverse task inputs), Verifier-Level Reward Computation (to deliver custom rewards via specialized verifiers) , and Source-Level Metric Monitoring (to diagnose problems at the data-source level). We further introduce a novel Dynamic IoU reward, which provides adaptive, progressive, and definite feedback for perception tasks handled by V-Triune. Our approach is instantiated within off-the-shelf RL training framework using open-source 7B and 32B backbone models. The resulting model, dubbed Orsta (One RL to See Them All), demonstrates consistent improvements across both reasoning and perception tasks. This broad capability is significantly shaped by its training on a diverse dataset, constructed around four representative visual reasoning tasks (Math, Puzzle, Chart, and Science) and four visual perception tasks (Grounding, Detection, Counting, and OCR). Subsequently, Orsta achieves substantial gains on MEGA-Bench Core, with improvements ranging from +2.1 to an impressive +14.1 across its various 7B and 32B model variants, with performance benefits extending to a wide range of downstream tasks. These results highlight the effectiveness and scalability of our unified RL approach for VLMs. The V-Triune system, along with the Orsta models, is publicly available at https://github.com/MiniMax-AI.
Problem

Research questions and friction points this paper is trying to address.

Unifying visual reasoning and perception tasks in VLMs
Developing adaptive rewards for perception-intensive tasks
Enhancing performance across diverse vision-language benchmarks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified RL for vision-language reasoning and perception
Triple components: data, verifier, source monitoring
Dynamic IoU reward for adaptive perception feedback
🔎 Similar Papers
No similar papers found.
Y
Yan Ma
L
Linge Du
Xuyang Shen
Xuyang Shen
MiniMax | ANU
Multimodal Machine Learning
S
Shaoxiang Chen
P
Pengfei Li
Qibing Ren
Qibing Ren
Shanghai Jiao Tong University
machine learningcomputer visiontrustworthy AI
L
Lizhuang Ma
Y
Yuchao Dai
P
Pengfei Liu
J
Junjie Yan