Rater Equivalence: Evaluating Classifiers in Human Judgment Settings

📅 2021-06-02

📈 Citations: 14

✨ Influential: 1

career value

199K/year

🤖 AI Summary

For high-stakes decision-making domains lacking ground-truth labels—such as judicial adjudication and clinical diagnosis—this paper proposes a human-centered classifier evaluation framework. Methodologically, it introduces the *Rater Equivalence Number* (REN), the first formal metric quantifying how many human raters’ collective judgment a model’s performance equivalently matches, thereby enabling interpretable, human-aligned assessment. The framework distinguishes two utility models: *ground-truth consistency* (agreement with latent consensus) and *individual judgment matching* (fidelity to diverse human judgments), supporting value-sensitive deployment trade-offs. It integrates crowdsourced annotation modeling, statistical consistency analysis, and benchmark panel construction to jointly generate an evaluation reference standard and quantify model performance from human-labeled data. Empirical case studies and formal analysis validate its theoretical soundness and practical efficacy, establishing an actionable evaluation paradigm and deployment guidance for AI systems operating without gold-standard labels.

📝 Abstract

In many decision settings, the definitive ground truth is either non-existent or inaccessible. We introduce a framework for evaluating classifiers based solely on human judgments. In such cases, it is helpful to compare automated classifiers to human judgment. We quantify a classifier's performance by its rater equivalence: the smallest number of human raters whose combined judgment matches the classifier's performance. Our framework uses human-generated labels both to construct benchmark panels and to evaluate performance. We distinguish between two models of utility: one based on agreement with the assumed but inaccessible ground truth, and one based on matching individual human judgments. Using case studies and formal analysis, we demonstrate how this framework can inform the evaluation and deployment of AI systems in practice.

Problem

Research questions and friction points this paper is trying to address.

Evaluating classifiers when ground truth is unavailable using human judgments

Quantifying classifier performance through rater equivalence metric

Comparing AI systems against human judgment panels for practical deployment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Framework evaluates classifiers using human judgments

Quantifies performance via rater equivalence metric

Uses human labels for benchmark construction and evaluation

🔎 Similar Papers

No similar papers found.