🤖 AI Summary
Existing LLM-based evaluators exhibit low reliability for software engineering artifacts—such as code generation, translation, and summarization—due to high cost, subjectivity, and poor scalability of human evaluation, and insufficient sensitivity of automated methods to fine-grained quality differences.
Method: We propose REFINE, a framework that integrates controllable fine-grained quality degradation synthesis with evaluator ranking alignment testing. It enables progressive tuning—from coarse-grained filtering to stress-testing subtle quality distinctions—supported by hierarchical dataset construction and a quantitative ranking consistency metric.
Contribution/Results: Evaluated on industrial-scale COBOL code data, REFINE automatically identifies and validates high-quality evaluator configurations. Experiments show it improves alignment between LLM evaluators and human annotations from <0.7 to >0.9 (Kendall’s τ). The framework has been deployed to support model release decisions in production training teams.
📝 Abstract
Automation in software engineering increasingly relies on large language models (LLMs) to generate, review, and assess code artifacts. However, establishing LLMs as reliable evaluators remains an open challenge: human evaluations are costly, subjective and non scalable, while existing automated methods fail to discern fine grained variations in artifact quality.
We introduce REFINE (Ranking Evaluators for FIne grained Nuanced Evaluation), an automated framework for benchmarking LLM based evaluators across software engineering tasks. REFINE comprises of two modules: Hierarchy Dataset Builder applies novel generation techniques to automatically synthesize artifacts with progressively reduced quality, and Evaluator Tester quantifies each candidate evaluator configuration by measuring how closely its rankings align with expected ordering.
A key feature of REFINE is controllability: users can tune the granularity of degradation to progressively refine evaluator configurations, from coarse filtering to stress testing on subtle quality gaps.
While the methodology is general, we focus on coding tasks reflecting the practical demands in our production setting. REFINE was integrated into IBM's internal development workflows and applied to code generation, translation, and summarization for COBOL, an enterprise critical programming language, using industrial data. It was used to identify LLM as a Judge configurations that lifted alignment scores from below $0.7$ to above $0.9$ in some coding tasks. These nuance sensitive evaluators are now actively used by model training teams to support model release decisions.