Unified Generation and Self-Verification for Vision-Language Models via Advantage Decoupled Preference Optimization

📅 2026-01-04

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

171K/year

🤖 AI Summary

This work proposes Advantage-Decoupled Preference Optimization (ADPO), a unified reinforcement learning framework that jointly learns answer generation and self-verification within a single policy, addressing the high training and inference costs of conventional parallel testing approaches that require separate models for generation and verification. ADPO leverages preference-based verification rewards for positive feedback and introduces a token masking mechanism to decouple gradients, enabling synergistic optimization of generation and verification. The method integrates a masked GRPO objective, verification scores from positive and negative samples, and an advantage function for gradient decoupling. Experiments demonstrate consistent improvements: accuracy gains of 2.8% and 1.4% on MathVista and MMMU, respectively; a 1.9 increase in cIoU on ReasonSeg; and success rate improvements of 1.7% and 1.0% on AndroidControl and GUI Odyssey tasks. Additionally, ADPO reduces inference time by 53.5% and achieves up to a 34.1% improvement in AUC.

Technology Category

Application Category

📝 Abstract

Parallel test-time scaling typically trains separate generation and verification models, incurring high training and inference costs. We propose Advantage Decoupled Preference Optimization (ADPO), a unified reinforcement learning framework that jointly learns answer generation and self-verification within a single policy. ADPO introduces two innovations: a preference verification reward improving verification capability and a decoupled optimization mechanism enabling synergistic optimization of generation and verification. Specifically, the preference verification reward computes mean verification scores from positive and negative samples as decision thresholds, providing positive feedback when prediction correctness aligns with answer correctness. Meanwhile, the advantage decoupled optimization computes separate advantages for generation and verification, applies token masks to isolate gradients, and combines masked GRPO objectives, preserving generation quality while calibrating verification scores. ADPO achieves up to +34.1% higher verification AUC and -53.5% lower inference time, with significant gains of +2.8%/+1.4% accuracy on MathVista/MMMU, +1.9 cIoU on ReasonSeg, and +1.7%/+1.0% step success rate on AndroidControl/GUI Odyssey.

Problem

Research questions and friction points this paper is trying to address.

vision-language models

test-time scaling

generation and verification

inference cost

training cost

Innovation

Methods, ideas, or system contributions that make the work stand out.

Advantage Decoupled Preference Optimization

self-verification

vision-language models