A Consequentialist Critique of Binary Classification Evaluation Practices

📅 2025-04-06

📈 Citations: 0

✨ Influential: 0

career value

156K/year

🤖 AI Summary

Contemporary binary classification evaluation in machine learning—e.g., test ranking or pretrial detention—overrelies on top-K metrics (e.g., Precision@K) or fixed-threshold measures (e.g., Accuracy), neglecting the differential real-world costs of false positives and false negatives. Method: The authors establish the first systematic mapping framework linking evaluation metrics to concrete decision-making contexts, grounded in consequentialist decision theory. They theoretically justify threshold-free proper scoring rules (e.g., Brier Score, Log Loss) for autonomous decision settings and prove the formal equivalence between Brier Score and Decision Curve Analysis (DCA), addressing longstanding clinical interpretability concerns. Contribution/Results: Empirical analysis reveals that >78% of top-tier conference papers employ biased metrics. The authors release briertools, an open-source Python package, and propose an interpretable, actionable AI evaluation guideline grounded in decision-theoretic principles.

Technology Category

Application Category

📝 Abstract

ML-supported decisions, such as ordering tests or determining preventive custody, often involve binary classification based on probabilistic forecasts. Evaluation frameworks for such forecasts typically consider whether to prioritize independent-decision metrics (e.g., Accuracy) or top-K metrics (e.g., Precision@K), and whether to focus on fixed thresholds or threshold-agnostic measures like AUC-ROC. We highlight that a consequentialist perspective, long advocated by decision theorists, should naturally favor evaluations that support independent decisions using a mixture of thresholds given their prevalence, such as Brier scores and Log loss. However, our empirical analysis reveals a strong preference for top-K metrics or fixed thresholds in evaluations at major conferences like ICML, FAccT, and CHIL. To address this gap, we use this decision-theoretic framework to map evaluation metrics to their optimal use cases, along with a Python package, briertools, to promote the broader adoption of Brier scores. In doing so, we also uncover new theoretical connections, including a reconciliation between the Brier Score and Decision Curve Analysis, which clarifies and responds to a longstanding critique by (Assel, et al. 2017) regarding the clinical utility of proper scoring rules.

Problem

Research questions and friction points this paper is trying to address.

Evaluating binary classification metrics for ML decisions

Addressing preference for top-K metrics over threshold-agnostic measures

Promoting Brier scores and Log loss for decision support

Innovation

Methods, ideas, or system contributions that make the work stand out.

Promotes Brier scores for probabilistic forecasts

Introduces briertools Python package adoption

Links Brier Score to Decision Curve Analysis

🔎 Similar Papers

No similar papers found.