Entropy-Guided Data-Efficient Training for Multimodal Reasoning Reward Models

📅 2026-02-02

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenges of training multimodal reasoning reward models, which are often hindered by high noise levels and low efficiency in preference data, thereby limiting their ability to align with human preferences. The authors propose Entropy-Guided Training (EGT), a novel approach that reveals, for the first time, a strong correlation between response entropy and both annotation noise and sample difficulty. Leveraging this insight, EGT introduces an unsupervised data filtering mechanism and an entropy-based curriculum learning strategy—both requiring no additional supervision. Integrated within a multimodal large language model architecture, the method significantly outperforms state-of-the-art reward models across three benchmarks, achieving notable improvements in both accuracy and training efficiency.

Technology Category

Application Category

📝 Abstract

Multimodal reward models are crucial for aligning multimodal large language models with human preferences. Recent works have incorporated reasoning capabilities into these models, achieving promising results. However, training these models suffers from two critical challenges: (1) the inherent noise in preference datasets, which degrades model performance, and (2) the inefficiency of conventional training methods, which ignore the differences in sample difficulty. In this paper, we identify a strong correlation between response entropy and accuracy, indicating that entropy can serve as a reliable and unsupervised proxy for annotation noise and sample difficulty. Based on this insight, we propose a novel Entropy-Guided Training (EGT) approach for multimodal reasoning reward models, which combines two strategies: (1) entropy-guided data curation to mitigate the impact of unreliable samples, and (2) an entropy-guided training strategy that progressively introduces more complex examples. Extensive experiments across three benchmarks show that the EGT-trained model consistently outperforms state-of-the-art multimodal reward models.

Problem

Research questions and friction points this paper is trying to address.

multimodal reward models

preference dataset noise

training inefficiency

sample difficulty

data-efficient training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Entropy-Guided Training

Multimodal Reward Models

Data-Efficient Learning

Response Entropy

Curriculum Learning

🔎 Similar Papers

No similar papers found.

Authors to Follow