Improving Reproducibility in Evaluation through Multi-Level Annotator Modeling

📅 2026-05-13

📈 Citations: 0

✨ Influential: 0

career value

153K/year

🤖 AI Summary

This study addresses the reproducibility crisis in AI evaluation, which stems from subjective biases in human annotation and limited data leading to unstable results. The authors propose a multilevel bootstrapping framework that, for the first time, leverages large-scale human rating data—complete with persistent annotator identifiers—to model annotator behavior and systematically quantify the joint impact of the number of items (N) and the number of ratings per item (K) on evaluation reproducibility. Through multilevel bootstrap resampling, variance modeling, and significance analysis, the work uncovers the trade-offs inherent in N–K configurations and derives optimal strategies for reliable assessment design. This approach provides a methodological foundation for establishing robust, data-driven paradigms in AI evaluation.

📝 Abstract

As generative AI models such as large language models (LLMs) become more pervasive, ensuring the safety, robustness, and overall trustworthiness of these systems is paramount. However, AI is currently facing a reproducibility crisis driven by unreliable evaluations and unrepeatable experimental results. While human raters are often used to assess models for utility and safety, they introduce divergent biases and subjective opinions into their annotations. Overcoming this variance is exceptionally challenging because very little data exists to study how experimental repeatability actually improves as the annotator pool grows. Standard evaluation practices typically rely on a small number of annotations per item (often 3 to 5) and lack the persistent rater identifiers necessary to model individual variance across items. In this work, we introduce a multi-level bootstrapping approach to realistically model annotator behavior. Leveraging datasets with a large number of ratings and persistent rater identifiers, we analyze the tradeoffs between the number of items ($N$) and the number of responses per item ($K$) required to achieve statistical significance.

Problem

Research questions and friction points this paper is trying to address.

reproducibility

evaluation

annotator bias

human annotation

statistical significance

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-level bootstrapping

annotator modeling

reproducibility