Omni-RRM: Advancing Omni Reward Modeling via Automatic Rubric-Grounded Preference Synthesis

📅 2026-01-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal reward models often suffer from visual-centric bias, opaque outputs, and reliance on costly human annotations, hindering effective alignment with multimodal large language models. To address these limitations, this work proposes Omni-RRM—the first open-source, general-purpose reward model that requires no human labels and automatically generates multimodal preferences based on scoring rules. The approach constructs a multimodal preference dataset, Omni-Preference, through automated data generation, followed by a two-stage training strategy involving supervised fine-tuning and GRPO reinforcement learning, with strong teacher-model-based filtering and alignment to scoring rules. Experiments demonstrate that Omni-RRM achieves 80.2% and 66.8% accuracy on video and audio benchmarks, respectively, outperforms baseline models by 17.7% on image tasks, and significantly improves downstream Best-of-N selection performance.

Technology Category

Application Category

📝 Abstract
Multimodal large language models (MLLMs) have shown remarkable capabilities, yet their performance is often capped by the coarse nature of existing alignment techniques. A critical bottleneck remains the lack of effective reward models (RMs): existing RMs are predominantly vision-centric, return opaque scalar scores, and rely on costly human annotations. We introduce \textbf{Omni-RRM}, the first open-source rubric-grounded reward model that produces structured, multi-dimension preference judgments with dimension-wise justifications across \textbf{text, image, video, and audio}. At the core of our approach is \textbf{Omni-Preference}, a large-scale dataset built via a fully automated pipeline: we synthesize candidate response pairs by contrasting models of different capabilities, and use strong teacher models to \emph{reconcile and filter} preferences while providing a modality-aware \emph{rubric-grounded rationale} for each pair. This eliminates the need for human-labeled training preferences. Omni-RRM is trained in two stages: supervised fine-tuning to learn the rubric-grounded outputs, followed by reinforcement learning (GRPO) to sharpen discrimination on difficult, low-contrast pairs. Comprehensive evaluations show that Omni-RRM achieves state-of-the-art accuracy on video (80.2\% on ShareGPT-V) and audio (66.8\% on Audio-HH-RLHF) benchmarks, and substantially outperforms existing open-source RMs on image tasks, with a 17.7\% absolute gain over its base model on overall accuracy. Omni-RRM also improves downstream performance via Best-of-$N$ selection and transfers to text-only preference benchmarks. Our data, code, and models are available at https://anonymous.4open.science/r/Omni-RRM-CC08.
Problem

Research questions and friction points this paper is trying to address.

reward modeling
multimodal alignment
preference synthesis
human annotation
structured judgment
Innovation

Methods, ideas, or system contributions that make the work stand out.

rubric-grounded reward modeling
multimodal preference learning
automatic preference synthesis
structured reward model
GRPO
🔎 Similar Papers
No similar papers found.
Z
Zicheng Kong
Beijing University of Posts and Telecommunications
D
Dehua Ma
Beijing University of Posts and Telecommunications
Z
Zhenbo Xu
Beijing University of Posts and Telecommunications
A
Alven Yang
Beijing University of Posts and Telecommunications
Y
Yiwei Ru
Beijing University of Posts and Telecommunications
Haoran Wang
Haoran Wang
Tsinghua University
Machine Learning‬NLPAgentLLM Post-Training
Z
Zixuan Zhou
Beijing University of Posts and Telecommunications
F
Fuqing Bie
Beijing University of Posts and Telecommunications
Liuyu Xiang
Liuyu Xiang
Beijing University of Posts and Telecommunications
Computer VisionReinforcement LearningLLM Agent
H
Huijia Wu
Beijing University of Posts and Telecommunications
Jian Zhao
Jian Zhao
Zhongguancun Institute of Artificial Intelligence
Reinforcement LearningMulti-Agent System
Z
Zhaofeng He
Beijing University of Posts and Telecommunications