Omni-RRM: Advancing Omni Reward Modeling via Automatic Rubric-Grounded Preference Synthesis

📅 2026-01-31

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Existing multimodal reward models often suffer from visual-centric bias, opaque outputs, and reliance on costly human annotations, hindering effective alignment with multimodal large language models. To address these limitations, this work proposes Omni-RRM—the first open-source, general-purpose reward model that requires no human labels and automatically generates multimodal preferences based on scoring rules. The approach constructs a multimodal preference dataset, Omni-Preference, through automated data generation, followed by a two-stage training strategy involving supervised fine-tuning and GRPO reinforcement learning, with strong teacher-model-based filtering and alignment to scoring rules. Experiments demonstrate that Omni-RRM achieves 80.2% and 66.8% accuracy on video and audio benchmarks, respectively, outperforms baseline models by 17.7% on image tasks, and significantly improves downstream Best-of-N selection performance.

Technology Category

Application Category

📝 Abstract

Multimodal large language models (MLLMs) have shown remarkable capabilities, yet their performance is often capped by the coarse nature of existing alignment techniques. A critical bottleneck remains the lack of effective reward models (RMs): existing RMs are predominantly vision-centric, return opaque scalar scores, and rely on costly human annotations. We introduce \textbf{Omni-RRM}, the first open-source rubric-grounded reward model that produces structured, multi-dimension preference judgments with dimension-wise justifications across \textbf{text, image, video, and audio}. At the core of our approach is \textbf{Omni-Preference}, a large-scale dataset built via a fully automated pipeline: we synthesize candidate response pairs by contrasting models of different capabilities, and use strong teacher models to \emph{reconcile and filter} preferences while providing a modality-aware \emph{rubric-grounded rationale} for each pair. This eliminates the need for human-labeled training preferences. Omni-RRM is trained in two stages: supervised fine-tuning to learn the rubric-grounded outputs, followed by reinforcement learning (GRPO) to sharpen discrimination on difficult, low-contrast pairs. Comprehensive evaluations show that Omni-RRM achieves state-of-the-art accuracy on video (80.2\% on ShareGPT-V) and audio (66.8\% on Audio-HH-RLHF) benchmarks, and substantially outperforms existing open-source RMs on image tasks, with a 17.7\% absolute gain over its base model on overall accuracy. Omni-RRM also improves downstream performance via Best-of-$N$ selection and transfers to text-only preference benchmarks. Our data, code, and models are available at https://anonymous.4open.science/r/Omni-RRM-CC08.

Problem

Research questions and friction points this paper is trying to address.

reward modeling

multimodal alignment

preference synthesis

human annotation

structured judgment

Innovation

Methods, ideas, or system contributions that make the work stand out.

rubric-grounded reward modeling

multimodal preference learning

automatic preference synthesis