PRISM: Pre-alignment via Black-box On-policy Distillation for Multimodal Reinforcement Learning

📅 2026-04-30
📈 Citations: 0
Influential: 0
📄 PDF

career value

201K/year
🤖 AI Summary
This work addresses the degradation of original capabilities in multimodal large language models after supervised fine-tuning—caused by distribution shift—and the entanglement of perception and reasoning errors that undermines subsequent reinforcement learning (RL). To mitigate these issues, the authors propose PRISM, a three-stage training framework that introduces an explicit distribution alignment phase between supervised fine-tuning and validation-reward-based RL. This phase formulates alignment as a response-level adversarial game between the policy and a Mixture-of-Experts (MoE) discriminator, enabling decoupled correction signals via black-box Online Policy Distillation (OPD) without requiring teacher logits. Evaluated on Qwen3-VL with 113K high-quality multimodal demonstrations (curated from human annotations and Gemini 3 Flash generations), PRISM significantly enhances downstream RLVR performance, improving average accuracy by 4.4 and 6.0 points for 4B and 8B models, respectively, while remaining compatible with diverse RL algorithms such as GRPO, DAPO, and GSPO.
📝 Abstract
The standard post-training recipe for large multimodal models (LMMs) applies supervised fine-tuning (SFT) on curated demonstrations followed by reinforcement learning with verifiable rewards (RLVR). However, SFT introduces distributional drift that neither preserves the model's original capabilities nor faithfully matches the supervision distribution. This problem is further amplified in multimodal reasoning, where perception errors and reasoning failures follow distinct drift patterns that compound during subsequent RL. We introduce PRISM, a three-stage pipeline that mitigates this drift by inserting an explicit distribution-alignment stage between SFT and RLVR. Building on the principle of on-policy distillation (OPD), PRISM casts alignment as a black-box, response-level adversarial game between the policy and a Mixture-of-Experts (MoE) discriminator with dedicated perception and reasoning experts, providing disentangled corrective signals that steer the policy toward the supervision distribution without requiring access to teacher logits. While 1.26M public demonstrations suffice for broad SFT initialization, distribution alignment demands higher-fidelity supervision; we therefore curate 113K additional demonstrations from Gemini 3 Flash, featuring dense visual grounding and step-by-step reasoning on the hardest unsolved problems. Experiments on Qwen3-VL show that PRISM consistently improves downstream RLVR performance across multiple RL algorithms (GRPO, DAPO, GSPO) and diverse multimodal benchmarks, improving average accuracy by +4.4 and +6.0 points over the SFT-to-RLVR baseline on 4B and 8B, respectively. Our code, data, and model checkpoints are publicly available at https://github.com/XIAO4579/PRISM.
Problem

Research questions and friction points this paper is trying to address.

distributional drift
multimodal reinforcement learning
supervised fine-tuning
perception-reasoning disentanglement
post-training alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

on-policy distillation
distribution alignment
multimodal reinforcement learning
Mixture-of-Experts discriminator
black-box adversarial training
🔎 Similar Papers
No similar papers found.