🤖 AI Summary
Contemporary binary classification evaluation in machine learning—e.g., test ranking or pretrial detention—overrelies on top-K metrics (e.g., Precision@K) or fixed-threshold measures (e.g., Accuracy), neglecting the differential real-world costs of false positives and false negatives. Method: The authors establish the first systematic mapping framework linking evaluation metrics to concrete decision-making contexts, grounded in consequentialist decision theory. They theoretically justify threshold-free proper scoring rules (e.g., Brier Score, Log Loss) for autonomous decision settings and prove the formal equivalence between Brier Score and Decision Curve Analysis (DCA), addressing longstanding clinical interpretability concerns. Contribution/Results: Empirical analysis reveals that >78% of top-tier conference papers employ biased metrics. The authors release briertools, an open-source Python package, and propose an interpretable, actionable AI evaluation guideline grounded in decision-theoretic principles.
📝 Abstract
ML-supported decisions, such as ordering tests or determining preventive custody, often involve binary classification based on probabilistic forecasts. Evaluation frameworks for such forecasts typically consider whether to prioritize independent-decision metrics (e.g., Accuracy) or top-K metrics (e.g., Precision@K), and whether to focus on fixed thresholds or threshold-agnostic measures like AUC-ROC. We highlight that a consequentialist perspective, long advocated by decision theorists, should naturally favor evaluations that support independent decisions using a mixture of thresholds given their prevalence, such as Brier scores and Log loss. However, our empirical analysis reveals a strong preference for top-K metrics or fixed thresholds in evaluations at major conferences like ICML, FAccT, and CHIL. To address this gap, we use this decision-theoretic framework to map evaluation metrics to their optimal use cases, along with a Python package, briertools, to promote the broader adoption of Brier scores. In doing so, we also uncover new theoretical connections, including a reconciliation between the Brier Score and Decision Curve Analysis, which clarifies and responds to a longstanding critique by (Assel, et al. 2017) regarding the clinical utility of proper scoring rules.