AXIOM: Benchmarking LLM-as-a-Judge for Code via Rule-Based Perturbation and Multisource Quality Calibration

📅 2025-12-23

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Existing LLM-based code evaluation metrics suffer from three key limitations: (1) rule-based metrics rely solely on syntactic or surface-level similarity, neglecting functional correctness and code quality; (2) coarse-grained benchmark labels (e.g., binary pass/fail) obscure subtle errors; and (3) fine-grained human annotations are subjective, ambiguous, and prone to distributional imbalance due to uncontrolled synthetic data generation. To address these issues, we propose the first LLM-as-a-judge benchmark specifically designed for code assessment. Our approach innovatively integrates rule-guided perturbation synthesis with multi-source quality calibration: it generates samples spanning the full quality spectrum via semantics-preserving, rule-constrained perturbations, and employs multi-LLM collaborative scoring, consistency-weighted aggregation, and high-quality reverse editing to enable controllable score generation and objective ground-truth calibration. Experiments demonstrate that our benchmark significantly enhances the reliability and discriminative power of LLM judges.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have been increasingly deployed in real-world software engineering, fostering the development of code evaluation metrics to study the quality of LLM-generated code. Conventional rule-based metrics merely score programs based on their surface-level similarities with reference programs instead of analyzing functionality and code quality in depth. To address this limitation, researchers have developed LLM-as-a-judge metrics, prompting LLMs to evaluate and score code, and curated various code evaluation benchmarks to validate their effectiveness. However, these benchmarks suffer from critical limitations, hindering reliable assessments of evaluation capability: Some feature coarse-grained binary labels, which reduce rich code behavior to a single bit of information, obscuring subtle errors. Others propose fine-grained but subjective, vaguely-defined evaluation criteria, introducing unreliability in manually-annotated scores, which is the ground-truth they rely on. Furthermore, they often use uncontrolled data synthesis methods, leading to unbalanced score distributions that poorly represent real-world code generation scenarios. To curate a diverse benchmark with programs of well-balanced distributions across various quality levels and streamline the manual annotation procedure, we propose AXIOM, a novel perturbation-based framework for synthesizing code evaluation benchmarks at scale. It reframes program scores as the refinement effort needed for deployment, consisting of two stages: (1) Rule-guided perturbation, which prompts LLMs to apply sequences of predefined perturbation rules to existing high-quality programs to modify their functionality and code quality, enabling us to precisely control each program's target score to achieve balanced score distributions. (2) Multisource quality calibration, which first selects a subset of...

Problem

Research questions and friction points this paper is trying to address.

Develops a benchmark to evaluate LLM-as-a-judge metrics for code assessment.

Addresses limitations of existing benchmarks with coarse or subjective scoring.

Uses rule-based perturbation to create balanced, realistic code quality distributions.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Rule-guided perturbation for balanced score distributions

Multisource quality calibration for reliable annotations

Perturbation-based framework synthesizing code evaluation benchmarks

🔎 Similar Papers

No similar papers found.