🤖 AI Summary
This study addresses the lack of effective and validated methods for quantifying inconsistency in clinical decision-making. The authors introduce the first controllable synthetic data benchmark to systematically evaluate the accuracy and rank fidelity of eight measurement approaches across 94 experimental conditions. The evaluated methods include Euclidean distance, Mahalanobis distance, learned weight matching, genetic Mahalanobis distance, random forest proximity, mutual information weighting, latent profile analysis, and Bayesian generalized linear mixed models (GLMMs). Results show that learned weight matching achieves the lowest error (MAE = 0.027), while supervised feature-weighting methods and GLMMs maintain high rank correlation (Spearman = 0.62–0.68) under scenarios involving continuous heterogeneity, demonstrating their robustness and practical applicability.
📝 Abstract
Intra-physician prescribing variability, the probability that one physician issues discordant decisions for two patients deemed comparable on observed covariates, holds great impact in quality of care, safety and cost. However, there are no known validated measurement methods. Here, we benchmark eight methods (Euclidean, Mahalanobis, Learned-Weights, Genetic Mahalanobis, Random Forest proximity, Mutual-Information-weighted, Latent Profile Analysis and Bayesian binomial generalized linear mixed model) against a synthetic ground truth across 94 experimental conditions. Learned-Weights matching achieves the lowest mean absolute error (0.027), followed by Mutual-Information-weighted matching (0.028) and RF Proximity (0.034). All eight discordance-analysis methods preserve the physician rank ordering with high fidelity (Spearman > 0.89 versus the ground truth on the SCORE2 experiment), as long as the physician variability groups are well separated. Under a continuous-heterogeneity physician model, rank preservation degrades substantially for unsupervised methods (Spearman = [0.28, 0.35]) but is retained by supervised feature-weighted methods and the GLMM (Spearman = [0.62, 0.68]). This controlled methodological evaluation is a foundation for validation on observational prescribing data. Once validated on observational prescribing data, these evaluated open-source estimators could turn prescribing inconsistency into a routinely measurable clinician-level quality metric, systematically complementing the existing literature on between-physician variation.