CompassJudger-2: Towards Generalist Judge Model via Verifiable Rewards

📅 2025-07-11

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

Current LLM-based judge models suffer from narrow domain coverage and poor cross-domain robustness, limiting their applicability in general-purpose evaluation. To address this, we propose CompassJudger-2: (1) a task-driven, multi-domain integrated training paradigm for judges, incorporating verifiable reward supervision to guide judgment reasoning; (2) a margin-augmented policy gradient loss jointly optimized with rejection sampling to foster intrinsic critical reasoning; and (3) JudgerBenchV2, a comprehensive benchmark evaluating both judgment accuracy and ranking consistency across diverse domains. Experiments show that the 7B-parameter CompassJudger-2 significantly outperforms peer judge/reward models on mainstream benchmarks. Notably, its judgment accuracy matches that of ultra-large models such as DeepSeek-V3 and Qwen3-235B-A22B, achieving— for the first time—simultaneous cross-domain consistency and robustness in a compact-parameter judge model.

Technology Category

Application Category

📝 Abstract

Recently, the role of LLM-as-judge in evaluating large language models has gained prominence. However, current judge models suffer from narrow specialization and limited robustness, undermining their capacity for comprehensive evaluations. In this work, we present CompassJudger-2, a novel generalist judge model that overcomes these limitations via a task-driven, multi-domain data curation strategy. Central to our approach is supervising judgment tasks with verifiable rewards, guiding intrinsic critical reasoning through rejection sampling to foster robust, generalizable judgment capabilities. We introduce a refined learning objective with margin policy gradient loss to enhance performance. Empirically, CompassJudger-2 achieves superior results across multiple judge and reward benchmarks, and our 7B model demonstrates competitive judgment accuracy with significantly larger models like DeepSeek-V3 and Qwen3-235B-A22B. Additionally, we propose JudgerBenchV2, a comprehensive benchmark evaluating cross-domain judgment accuracy and rank consistency to standardize judge model evaluation. These contributions advance robust, scalable LLM judgment and establish new performance and evaluation standards.

Problem

Research questions and friction points this paper is trying to address.

Overcoming narrow specialization in LLM judge models

Enhancing robustness via verifiable reward supervision

Establishing standardized evaluation benchmarks for judge models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Task-driven multi-domain data curation strategy

Verifiable rewards for judgment supervision

Margin policy gradient loss enhancement

🔎 Similar Papers

Language Model Council: Democratically Benchmarking Foundation Models on Highly Subjective Tasks