STABLEVAL: Disagreement-Aware and Stable Evaluation of AI Systems

📅 2026-05-03
📈 Citations: 0
Influential: 0
📄 PDF

career value

202K/year
🤖 AI Summary
This work addresses the fragility of rankings in human evaluation of AI systems caused by traditional majority voting, which ignores annotator reliability and item ambiguity. To overcome this limitation, the authors propose STABLEVAL, a novel framework that explicitly treats ranking stability as a primary objective. STABLEVAL employs Bayesian modeling to jointly capture the latent correctness of items, annotator-specific confusion patterns, and task ambiguity, yielding uncertainty-aware posterior expected scores and calibrated system-level ratings. Crucially, it distinguishes between evaluation stability and hard-label recovery, thereby transcending the constraints of conventional denoising paradigms. Experimental results demonstrate that STABLEVAL significantly reduces scoring errors and ranking volatility on both synthetic and real human-annotation benchmarks, exhibiting particular robustness under conditions of annotator heterogeneity and adversarial noise.
📝 Abstract
Human evaluation remains the primary standard for assessing modern AI systems, yet annotator disagreement, bias, and variability make system rankings fragile under standard majority vote aggregation. Majority vote discards annotator reliability and item-level ambiguity, often yielding unstable comparisons across annotator subsets. We introduce STABLEVAL, a disagreement-aware evaluation framework that models latent item correctness and annotator-specific confusion patterns to produce posterior expected item credit and calibrated agent-level scores. Unlike label-denoising approaches such as Dawid-Skene, STABLEVAL is explicitly designed for stable and uncertainty-aware system evaluation rather than hard label recovery. We formalize ranking stability as a first-class evaluation objective and analyze how aggregation methods preserve or distort underlying annotator behavior. Across controlled synthetic experiments and multiple real-world human-annotated benchmarks, majority vote exhibits increasing score error and ranking instability under annotator heterogeneity and adversarial noise, while STABLEVAL yields more stable and statistically grounded system rankings. These results demonstrate that modeling disagreement is essential for robust and reproducible AI evaluation.
Problem

Research questions and friction points this paper is trying to address.

annotator disagreement
evaluation stability
AI system evaluation
ranking instability
human evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

disagreement-aware evaluation
ranking stability
annotator modeling
uncertainty-aware scoring
STABLEVAL
A
Akash Bonagiri
Department of Computer Science, University of California, Davis, Davis, CA, USA
G
Gerard Janno Anderias
Department of Computer Science, University of California, Davis, Davis, CA, USA
S
Saee Patil
Department of Computer Science, University of California, Davis, Davis, CA, USA
A
Angelina Lai
Department of Computer Science, University of California, Davis, Davis, CA, USA
D
Devang Borkar
Department of Computer Science, University of California, Davis, Davis, CA, USA
G
Gezheng Kang
Department of Computer Science, University of California, Davis, Davis, CA, USA
I
Ishant Gandhi
Department of Computer Science, University of California, Davis, Davis, CA, USA
Setareh Rafatirad
Setareh Rafatirad
Associate Professor, Computer Science Department, University of California Davis
Mobile SecurityEdge Device TrustApplied Machine LearningCybersecurityHW/SW Co-Design
Houman Homayoun
Houman Homayoun
University of California Davis
Applied Machine LearningSystem SecurityHardware SecurityComputer ArchitecturemHealth