🤖 AI Summary
This work addresses the fragility of rankings in human evaluation of AI systems caused by traditional majority voting, which ignores annotator reliability and item ambiguity. To overcome this limitation, the authors propose STABLEVAL, a novel framework that explicitly treats ranking stability as a primary objective. STABLEVAL employs Bayesian modeling to jointly capture the latent correctness of items, annotator-specific confusion patterns, and task ambiguity, yielding uncertainty-aware posterior expected scores and calibrated system-level ratings. Crucially, it distinguishes between evaluation stability and hard-label recovery, thereby transcending the constraints of conventional denoising paradigms. Experimental results demonstrate that STABLEVAL significantly reduces scoring errors and ranking volatility on both synthetic and real human-annotation benchmarks, exhibiting particular robustness under conditions of annotator heterogeneity and adversarial noise.
📝 Abstract
Human evaluation remains the primary standard for assessing modern AI systems, yet annotator disagreement, bias, and variability make system rankings fragile under standard majority vote aggregation. Majority vote discards annotator reliability and item-level ambiguity, often yielding unstable comparisons across annotator subsets. We introduce STABLEVAL, a disagreement-aware evaluation framework that models latent item correctness and annotator-specific confusion patterns to produce posterior expected item credit and calibrated agent-level scores. Unlike label-denoising approaches such as Dawid-Skene, STABLEVAL is explicitly designed for stable and uncertainty-aware system evaluation rather than hard label recovery. We formalize ranking stability as a first-class evaluation objective and analyze how aggregation methods preserve or distort underlying annotator behavior. Across controlled synthetic experiments and multiple real-world human-annotated benchmarks, majority vote exhibits increasing score error and ranking instability under annotator heterogeneity and adversarial noise, while STABLEVAL yields more stable and statistically grounded system rankings. These results demonstrate that modeling disagreement is essential for robust and reproducible AI evaluation.