🤖 AI Summary
Existing reward models suffer from two key limitations: modality imbalance—restricted to text or image modalities—and preference rigidity—relying on fixed binary preference pairs. To address these, we propose Omni-RewardModel, the first unified reward modeling framework spanning text, images, video, audio, and 3D data. We introduce a free-form preference annotation mechanism enabling fine-grained, user-specific preference expression. We further construct Omni-RewardBench—a comprehensive evaluation suite comprising nine cross-modal tasks—and Omni-RewardData, a large-scale, multimodal preference dataset. Methodologically, Omni-RewardModel adopts a dual-path architecture integrating discriminative and generative paradigms, jointly optimizing multimodal representation learning, instruction tuning, and free-form preference modeling. Extensive experiments demonstrate that Omni-RewardModel significantly outperforms state-of-the-art methods on both our benchmark and established downstream tasks, achieving consistent cross-modal reward prediction and deeper alignment with nuanced human preferences.
📝 Abstract
Reward models (RMs) play a critical role in aligning AI behaviors with human preferences, yet they face two fundamental challenges: (1) Modality Imbalance, where most RMs are mainly focused on text and image modalities, offering limited support for video, audio, and other modalities; and (2) Preference Rigidity, where training on fixed binary preference pairs fails to capture the complexity and diversity of personalized preferences. To address the above challenges, we propose Omni-Reward, a step toward generalist omni-modal reward modeling with support for free-form preferences, consisting of: (1) Evaluation: We introduce Omni-RewardBench, the first omni-modal RM benchmark with free-form preferences, covering nine tasks across five modalities including text, image, video, audio, and 3D; (2) Data: We construct Omni-RewardData, a multimodal preference dataset comprising 248K general preference pairs and 69K instruction-tuning pairs for training generalist omni-modal RMs; (3) Model: We propose Omni-RewardModel, which includes both discriminative and generative RMs, and achieves strong performance on Omni-RewardBench as well as other widely used reward modeling benchmarks.