🤖 AI Summary
For high-stakes decision-making domains lacking ground-truth labels—such as judicial adjudication and clinical diagnosis—this paper proposes a human-centered classifier evaluation framework. Methodologically, it introduces the *Rater Equivalence Number* (REN), the first formal metric quantifying how many human raters’ collective judgment a model’s performance equivalently matches, thereby enabling interpretable, human-aligned assessment. The framework distinguishes two utility models: *ground-truth consistency* (agreement with latent consensus) and *individual judgment matching* (fidelity to diverse human judgments), supporting value-sensitive deployment trade-offs. It integrates crowdsourced annotation modeling, statistical consistency analysis, and benchmark panel construction to jointly generate an evaluation reference standard and quantify model performance from human-labeled data. Empirical case studies and formal analysis validate its theoretical soundness and practical efficacy, establishing an actionable evaluation paradigm and deployment guidance for AI systems operating without gold-standard labels.
📝 Abstract
In many decision settings, the definitive ground truth is either non-existent or inaccessible. We introduce a framework for evaluating classifiers based solely on human judgments. In such cases, it is helpful to compare automated classifiers to human judgment. We quantify a classifier's performance by its rater equivalence: the smallest number of human raters whose combined judgment matches the classifier's performance. Our framework uses human-generated labels both to construct benchmark panels and to evaluate performance. We distinguish between two models of utility: one based on agreement with the assumed but inaccessible ground truth, and one based on matching individual human judgments. Using case studies and formal analysis, we demonstrate how this framework can inform the evaluation and deployment of AI systems in practice.