Visual Preference Optimization with Rubric Rewards

📅 2026-04-14
📈 Citations: 0
Influential: 0
📄 PDF

career value

182K/year
🤖 AI Summary
This work addresses the limitations of existing visual preference optimization methods, which rely on coarse-grained feedback or off-policy perturbations and struggle to support fine-grained visual reasoning. The authors propose rDPO, a novel framework that introduces, for the first time, an instance-level scoring rubric mechanism. This mechanism defines core and auxiliary criteria for each image-instruction pair, enabling the construction of an offline rubric pool subsequently used for online policy data generation and response scoring. By integrating rubric-guided prompt engineering and rubric-based response filtering, rDPO significantly enhances the quality and task relevance of preference data. Experimental results demonstrate that rDPO achieves performance close to GPT-5.4 on established reward modeling benchmarks, with a downstream macro-average score of 82.69 and a composite scalability score of 61.01, both surpassing current state-of-the-art baselines.

Technology Category

Application Category

📝 Abstract
The effectiveness of Direct Preference Optimization (DPO) depends on preference data that reflect the quality differences that matter in multimodal tasks. Existing pipelines often rely on off-policy perturbations or coarse outcome-based signals, which are not well suited to fine-grained visual reasoning. We propose rDPO, a preference optimization framework based on instance-specific rubrics. For each image-instruction pair, we create a checklist-style rubric of essential and additional criteria to score responses from any possible policies. The instruction-rubric pool is built offline and reused during the construction of on-policy data. On public reward modeling benchmarks, rubric-based prompting massively improves a 30B-A3B judge and brings it close to GPT-5.4. On public downstream benchmarks, rubric-based filtering raises the macro average to 82.69, whereas outcome-based filtering drops it to 75.82 from 81.14. When evaluating scalability on a comprehensive benchmark, rDPO achieves 61.01, markedly outperforming the style-constrained baseline (52.36) and surpassing the 59.48 base model. Together, these results show that visual preference optimization benefits from combining on-policy data construction with instance-specific criterion-level feedback.
Problem

Research questions and friction points this paper is trying to address.

Visual Preference Optimization
Multimodal Tasks
Fine-grained Visual Reasoning
Preference Data
Rubric-based Evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

rubric-based preference optimization
visual reasoning
on-policy data construction
multimodal alignment
fine-grained feedback
🔎 Similar Papers
No similar papers found.
Ya-Qi Yu
Ya-Qi Yu
Huawei Technologies
Computer VisionMachine Learning
F
Fangyu Hong
Huawei Technologies Co., Ltd.
X
Xiangyang Qu
Huawei Technologies Co., Ltd.
Hao Wang
Hao Wang
Ph.D from Peking University 2017 then Joined Huawei
Pervasive ComputingDevice-free Wi-Fi Human SensingMobile SensingSmart Environments
G
Gaojie Wu
Huawei Technologies Co., Ltd.
Q
Qiaoyu Luo
Huawei Technologies Co., Ltd.
N
Nuo Xu
Huawei Technologies Co., Ltd.
H
Huixin Wang
Huawei Technologies Co., Ltd.
W
Wuheng Xu
Huawei Technologies Co., Ltd.
Y
Yongxin Liao
Huawei Technologies Co., Ltd.
Z
Zihao Chen
Huawei Technologies Co., Ltd.
H
Haonan Li
Huawei Technologies Co., Ltd.
Z
Ziming Li
Huawei Technologies Co., Ltd.
Dezhi Peng
Dezhi Peng
Huawei Technologies, South China University of Technology
Computer Vision
M
Minghui Liao
Huawei Technologies Co., Ltd.
Jihao Wu
Jihao Wu
Huawei Inc.
Computer VisionMulti-Modality
H
Haoyu Ren
Huawei Technologies Co., Ltd.
D
Dandan Tu
Huawei Technologies Co., Ltd.