๐ค AI Summary
This work addresses the challenges of training multimodal reasoning reward models, which are often hindered by high noise levels and low efficiency in preference data, thereby limiting their ability to align with human preferences. The authors propose Entropy-Guided Training (EGT), a novel approach that reveals, for the first time, a strong correlation between response entropy and both annotation noise and sample difficulty. Leveraging this insight, EGT introduces an unsupervised data filtering mechanism and an entropy-based curriculum learning strategyโboth requiring no additional supervision. Integrated within a multimodal large language model architecture, the method significantly outperforms state-of-the-art reward models across three benchmarks, achieving notable improvements in both accuracy and training efficiency.
๐ Abstract
Multimodal reward models are crucial for aligning multimodal large language models with human preferences. Recent works have incorporated reasoning capabilities into these models, achieving promising results. However, training these models suffers from two critical challenges: (1) the inherent noise in preference datasets, which degrades model performance, and (2) the inefficiency of conventional training methods, which ignore the differences in sample difficulty. In this paper, we identify a strong correlation between response entropy and accuracy, indicating that entropy can serve as a reliable and unsupervised proxy for annotation noise and sample difficulty. Based on this insight, we propose a novel Entropy-Guided Training (EGT) approach for multimodal reasoning reward models, which combines two strategies: (1) entropy-guided data curation to mitigate the impact of unreliable samples, and (2) an entropy-guided training strategy that progressively introduces more complex examples. Extensive experiments across three benchmarks show that the EGT-trained model consistently outperforms state-of-the-art multimodal reward models.