🤖 AI Summary
Existing multimodal reward models often suffer from visual-centric bias, opaque outputs, and reliance on costly human annotations, hindering effective alignment with multimodal large language models. To address these limitations, this work proposes Omni-RRM—the first open-source, general-purpose reward model that requires no human labels and automatically generates multimodal preferences based on scoring rules. The approach constructs a multimodal preference dataset, Omni-Preference, through automated data generation, followed by a two-stage training strategy involving supervised fine-tuning and GRPO reinforcement learning, with strong teacher-model-based filtering and alignment to scoring rules. Experiments demonstrate that Omni-RRM achieves 80.2% and 66.8% accuracy on video and audio benchmarks, respectively, outperforms baseline models by 17.7% on image tasks, and significantly improves downstream Best-of-N selection performance.
📝 Abstract
Multimodal large language models (MLLMs) have shown remarkable capabilities, yet their performance is often capped by the coarse nature of existing alignment techniques. A critical bottleneck remains the lack of effective reward models (RMs): existing RMs are predominantly vision-centric, return opaque scalar scores, and rely on costly human annotations. We introduce \textbf{Omni-RRM}, the first open-source rubric-grounded reward model that produces structured, multi-dimension preference judgments with dimension-wise justifications across \textbf{text, image, video, and audio}. At the core of our approach is \textbf{Omni-Preference}, a large-scale dataset built via a fully automated pipeline: we synthesize candidate response pairs by contrasting models of different capabilities, and use strong teacher models to \emph{reconcile and filter} preferences while providing a modality-aware \emph{rubric-grounded rationale} for each pair. This eliminates the need for human-labeled training preferences. Omni-RRM is trained in two stages: supervised fine-tuning to learn the rubric-grounded outputs, followed by reinforcement learning (GRPO) to sharpen discrimination on difficult, low-contrast pairs. Comprehensive evaluations show that Omni-RRM achieves state-of-the-art accuracy on video (80.2\% on ShareGPT-V) and audio (66.8\% on Audio-HH-RLHF) benchmarks, and substantially outperforms existing open-source RMs on image tasks, with a 17.7\% absolute gain over its base model on overall accuracy. Omni-RRM also improves downstream performance via Best-of-$N$ selection and transfers to text-only preference benchmarks. Our data, code, and models are available at https://anonymous.4open.science/r/Omni-RRM-CC08.