DT2IT-MRM: Debiased Preference Construction and Iterative Training for Multimodal Reward Modeling

📅 2026-04-21

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Existing open-source multimodal preference datasets commonly suffer from coarse-grained preference intensity, textual style biases, and unreliable preference signals, compounded by the absence of efficient and scalable data-cleaning methodologies. To address these limitations, this work proposes the DT2IT-MRM framework, which for the first time integrates debiased data construction, text-to-image (T2I) preference reconstruction, and an iterative training mechanism to systematically enhance both the quality of multimodal preference data and the performance of reward models. The proposed approach achieves state-of-the-art overall results across three major benchmarks: VL-RewardBench, Multimodal RewardBench, and MM-RLHF-RewardBench.

Technology Category

Application Category

📝 Abstract

Multimodal reward models (MRMs) play a crucial role in aligning Multimodal Large Language Models (MLLMs) with human preferences. Training a good MRM requires high-quality multimodal preference data. However, existing preference datasets face three key challenges: lack of granularity in preference strength, textual style bias, and unreliable preference signals. Besides, existing open-source multimodal preference datasets suffer from substantial noise, yet there is a lack of effective and scalable curation methods to enhance their quality. To address these limitations, we propose \textbf{DT2IT-MRM}, which integrates a \textbf{D}ebiased preference construction pipeline, a novel reformulation of text-to-image (\textbf{T2I}) preference data, and an \textbf{I}terative \textbf{T}raining framework that curates existing multimodal preference datasets for \textbf{M}ultimodal \textbf{R}eward \textbf{M}odeling. Our experimental results show that DT2IT-MRM achieves new \textbf{state-of-the-art} overall performance on three major benchmarks: VL-RewardBench, Multimodal RewardBench, and MM-RLHF-RewardBench.

Problem

Research questions and friction points this paper is trying to address.

multimodal reward modeling

preference dataset

textual style bias

preference signal reliability

data noise

Innovation

Methods, ideas, or system contributions that make the work stand out.

Debiased Preference Construction

Text-to-Image Preference Reformulation

Iterative Training Framework