Auto-Rubric: Learning to Extract Generalizable Criteria for Reward Modeling

📅 2025-10-20

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Reward modeling faces challenges of expensive preference data collection, poor interpretability, and limited generalization. This paper proposes Auto-Rubric, a training-free, general-purpose rubric extraction framework that automatically generates hierarchical Theme-Tips rubrics for the first time. Our method employs a two-stage Propose-Evaluate-Revise pipeline guided by information-theoretic coding rate–driven compression, integrating verification-guided generation, hierarchical clustering, and semantic redundancy reduction. Evaluated on only 70 preference pairs (1.5% of the original dataset), Auto-Rubric enables small models such as Qwen3-8B to outperform fully trained domain-specific reward models. It significantly improves data efficiency, cross-task generalization, and interpretability in aligning with human preferences—enabling transparent, principle-based reward inference without parameter optimization.

Technology Category

Application Category

📝 Abstract

Reward models are essential for aligning Large Language Models (LLMs) with human values, yet their development is hampered by costly preference datasets and poor interpretability. While recent rubric-based approaches offer transparency, they often lack systematic quality control and optimization, creating a trade-off between scalability and reliability. We address these limitations with a novel, training-free framework built on a key assumption: extit{evaluation rubrics underlying human preferences exhibit significant generalization ability across diverse queries}, a property that enables remarkable data efficiency. Our two-stage approach first infers high-quality, query-specific rubrics using a validation-guided extbf{Propose-Evaluate-Revise} pipeline. Second, it generalizes these granular rubrics into a compact, non-redundant core set by maximizing an extbf{information-theoretic coding rate}. The final output is an interpretable, hierarchical "Theme-Tips" rubric set. Extensive experiments demonstrate the framework's exceptional data efficiency and performance. Critically, using just 70 preference pairs (1.5% of the source data), our method also empowers smaller models like Qwen3-8B to outperform specialized, fully-trained counterparts. This work pioneers a scalable, interpretable, and data-efficient path for reward modeling.

Problem

Research questions and friction points this paper is trying to address.

Developing cost-effective reward models for LLM alignment

Creating interpretable rubrics with systematic quality control

Enabling data-efficient generalization across diverse queries

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free framework infers query-specific rubrics automatically

Generalizes rubrics using information-theoretic coding rate optimization

Produces hierarchical Theme-Tips rubric set for interpretability

🔎 Similar Papers

REvolve: Reward Evolution with Large Language Models using Human Feedback

2024-06-03Citations: 1

A Review of Reward Functions for Reinforcement Learning in the context of Autonomous Driving

2024-04-122024 IEEE Intelligent Vehicles Symposium (IV)Citations: 8

💼 Related Jobs

No related jobs found.

Authors to Follow