Unified Generation and Self-Verification for Vision-Language Models via Advantage Decoupled Preference Optimization

📅 2026-01-04
🏛️ arXiv.org
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes Advantage-Decoupled Preference Optimization (ADPO), a unified reinforcement learning framework that jointly learns answer generation and self-verification within a single policy, addressing the high training and inference costs of conventional parallel testing approaches that require separate models for generation and verification. ADPO leverages preference-based verification rewards for positive feedback and introduces a token masking mechanism to decouple gradients, enabling synergistic optimization of generation and verification. The method integrates a masked GRPO objective, verification scores from positive and negative samples, and an advantage function for gradient decoupling. Experiments demonstrate consistent improvements: accuracy gains of 2.8% and 1.4% on MathVista and MMMU, respectively; a 1.9 increase in cIoU on ReasonSeg; and success rate improvements of 1.7% and 1.0% on AndroidControl and GUI Odyssey tasks. Additionally, ADPO reduces inference time by 53.5% and achieves up to a 34.1% improvement in AUC.

Technology Category

Application Category

📝 Abstract
Parallel test-time scaling typically trains separate generation and verification models, incurring high training and inference costs. We propose Advantage Decoupled Preference Optimization (ADPO), a unified reinforcement learning framework that jointly learns answer generation and self-verification within a single policy. ADPO introduces two innovations: a preference verification reward improving verification capability and a decoupled optimization mechanism enabling synergistic optimization of generation and verification. Specifically, the preference verification reward computes mean verification scores from positive and negative samples as decision thresholds, providing positive feedback when prediction correctness aligns with answer correctness. Meanwhile, the advantage decoupled optimization computes separate advantages for generation and verification, applies token masks to isolate gradients, and combines masked GRPO objectives, preserving generation quality while calibrating verification scores. ADPO achieves up to +34.1% higher verification AUC and -53.5% lower inference time, with significant gains of +2.8%/+1.4% accuracy on MathVista/MMMU, +1.9 cIoU on ReasonSeg, and +1.7%/+1.0% step success rate on AndroidControl/GUI Odyssey.
Problem

Research questions and friction points this paper is trying to address.

vision-language models
test-time scaling
generation and verification
inference cost
training cost
Innovation

Methods, ideas, or system contributions that make the work stand out.

Advantage Decoupled Preference Optimization
self-verification
vision-language models
preference reward
decoupled optimization
🔎 Similar Papers
No similar papers found.
X
Xinyu Qiu
College of Computer Science and Technology, Zhejiang University
Heng Jia
Heng Jia
Zhejiang University
Z
Zhengwen Zeng
Venus Team, Ant Group
Shuheng Shen
Shuheng Shen
Ant Group
Machine LearningOptimizationPrivacy
C
Changhua Meng
Venus Team, Ant Group
Yi Yang
Yi Yang
Zhejiang University
multimediacomputer visionmachine learning
L
Linchao Zhu
College of Computer Science and Technology, Zhejiang University