IF-RewardBench: Benchmarking Judge Models for Instruction-Following Evaluation

📅 2026-03-05

📈 Citations: 1

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Existing instruction-following meta-evaluation benchmarks suffer from insufficient data coverage and oversimplified evaluation paradigms, limiting their ability to accurately reflect the performance of discriminative models in real-world alignment scenarios. To address this, this work proposes IF-RewardBench, a comprehensive benchmark encompassing diverse instruction types and constraints, which introduces—for the first time—a listwise ranking evaluation paradigm based on multi-response preference graphs. This approach better aligns with practical alignment requirements and significantly enhances the correlation between evaluation outcomes and downstream task performance. Experimental results reveal substantial deficiencies in current discriminative models’ instruction-following capabilities, while demonstrating that IF-RewardBench achieves stronger positive correlation and greater evaluative validity compared to existing benchmarks.

Technology Category

Application Category

📝 Abstract

Instruction-following is a foundational capability of large language models (LLMs), with its improvement hinging on scalable and accurate feedback from judge models. However, the reliability of current judge models in instruction-following remains underexplored due to several deficiencies of existing meta-evaluation benchmarks, such as their insufficient data coverage and oversimplified pairwise evaluation paradigms that misalign with model optimization scenarios. To this end, we propose IF-RewardBench, a comprehensive meta-evaluation benchmark for instruction-following that covers diverse instruction and constraint types. For each instruction, we construct a preference graph containing all pairwise preferences among multiple responses based on instruction-following quality. This design enables a listwise evaluation paradigm that assesses the capabilities of judge models to rank multiple responses, which is essential in guiding model alignment. Extensive experiments on IF-RewardBench reveal significant deficiencies in current judge models and demonstrate that our benchmark achieves a stronger positive correlation with downstream task performance compared to existing benchmarks. Our codes and data are available at https://github.com/thu-coai/IF-RewardBench.

Problem

Research questions and friction points this paper is trying to address.

instruction-following

judge models

meta-evaluation benchmark

preference evaluation

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

instruction-following

judge models

preference graph