Entropy-Guided Data-Efficient Training for Multimodal Reasoning Reward Models

๐Ÿ“… 2026-02-02
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the challenges of training multimodal reasoning reward models, which are often hindered by high noise levels and low efficiency in preference data, thereby limiting their ability to align with human preferences. The authors propose Entropy-Guided Training (EGT), a novel approach that reveals, for the first time, a strong correlation between response entropy and both annotation noise and sample difficulty. Leveraging this insight, EGT introduces an unsupervised data filtering mechanism and an entropy-based curriculum learning strategyโ€”both requiring no additional supervision. Integrated within a multimodal large language model architecture, the method significantly outperforms state-of-the-art reward models across three benchmarks, achieving notable improvements in both accuracy and training efficiency.

Technology Category

Application Category

๐Ÿ“ Abstract
Multimodal reward models are crucial for aligning multimodal large language models with human preferences. Recent works have incorporated reasoning capabilities into these models, achieving promising results. However, training these models suffers from two critical challenges: (1) the inherent noise in preference datasets, which degrades model performance, and (2) the inefficiency of conventional training methods, which ignore the differences in sample difficulty. In this paper, we identify a strong correlation between response entropy and accuracy, indicating that entropy can serve as a reliable and unsupervised proxy for annotation noise and sample difficulty. Based on this insight, we propose a novel Entropy-Guided Training (EGT) approach for multimodal reasoning reward models, which combines two strategies: (1) entropy-guided data curation to mitigate the impact of unreliable samples, and (2) an entropy-guided training strategy that progressively introduces more complex examples. Extensive experiments across three benchmarks show that the EGT-trained model consistently outperforms state-of-the-art multimodal reward models.
Problem

Research questions and friction points this paper is trying to address.

multimodal reward models
preference dataset noise
training inefficiency
sample difficulty
data-efficient training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Entropy-Guided Training
Multimodal Reward Models
Data-Efficient Learning
Response Entropy
Curriculum Learning
๐Ÿ”Ž Similar Papers
No similar papers found.
S
Shidong Yang
School of Software, Tsinghua University
T
Tongwen Huang
AMAP, Alibaba Group
H
Hao Wen
AMAP, Alibaba Group
Yong Wang
Yong Wang
Academy of Mathematics and Systems Science, Chinese Academy of Sciences
OptimizationBioinformaticsSystems biologyComplex networkComputational Biology
L
Li Chen
School of Software, Tsinghua University
X
Xiangxiang Chu
AMAP, Alibaba Group