Learning What Matters: Dynamic Dimension Selection and Aggregation for Interpretable Vision-Language Reward Modeling

📅 2026-04-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the longstanding trade-off between interpretability and efficiency in vision-language reward modeling, where generative approaches offer transparency at high computational cost, while discriminative methods prioritize efficiency at the expense of explainability. To bridge this gap, we propose VL-MDR, a novel framework that introduces a dynamic multi-dimensional reward mechanism. VL-MDR employs a vision-aware gating module to adaptively select and weight fine-grained evaluation dimensions—such as hallucination and reasoning—to produce efficient, dimension-level interpretable reward scores. We construct a large-scale preference dataset comprising 321k samples annotated across 21 fine-grained dimensions and integrate multi-dimensional decomposition, dynamic gating, dimension-adaptive weighting, and DPO alignment. Experiments demonstrate that VL-MDR outperforms existing open-source reward models on benchmarks including VL-RewardBench, significantly mitigating visual hallucinations and enhancing model reliability and alignment.
📝 Abstract
Vision-language reward modeling faces a dilemma: generative approaches are interpretable but slow, while discriminative ones are efficient but act as opaque "black boxes." To bridge this gap, we propose VL-MDR (Vision-Language Multi-Dimensional Reward), a framework that dynamically decomposes evaluation into granular, interpretable dimensions. Instead of outputting a monolithic scalar, VL-MDR employs a visual-aware gating mechanism to identify relevant dimensions and adaptively weight them (e.g., Hallucination, Reasoning) for each specific input. To support this, we curate a dataset of 321k vision-language preference pairs annotated across 21 fine-grained dimensions. Extensive experiments show that VL-MDR consistently outperforms existing open-source reward models on benchmarks like VL-RewardBench. Furthermore, we show that VL-MDR-constructed preference pairs effectively enable DPO alignment to mitigate visual hallucinations and improve reliability, providing a scalable solution for VLM alignment.
Problem

Research questions and friction points this paper is trying to address.

vision-language reward modeling
interpretability
efficiency
black-box models
reward modeling
Innovation

Methods, ideas, or system contributions that make the work stand out.

dynamic dimension selection
interpretable reward modeling
vision-language alignment
visual-aware gating
multi-dimensional reward
🔎 Similar Papers
No similar papers found.
Q
Qiyuan Chen
Zhejiang University
H
Hongsen Huang
Soochow Securities Co., Ltd.
J
Jiahe Chen
Zhejiang University
Q
Qian Shao
Zhejiang University
Jintai Chen
Jintai Chen
Assistant Professor@HKUST(GZ)
AI for HealthcareMultimodal LearningDeep Tabular Learning
Hongxia Xu
Hongxia Xu
Zhejiang University
AI4ScienceNanomedicineMedical imaging
R
Renjie Hua
Nanjing University
C
Chuan Ren
Soochow Securities Co., Ltd.
J
Jian Wu
Zhejiang University