Post-Selection Distributional Model Evaluation

📅 2026-03-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the post-selection bias that arises when model preselection and performance metric estimation are conducted on the same dataset, which obscures the accurate characterization of the performance–reliability trade-off. To resolve this, the paper proposes a general framework enabling statistically valid inference of key performance indicator (KPI) distributions after any data-dependent model preselection. The core innovation lies in the novel integration of e-values into post-selection evaluation, achieving rigorous control of the false coverage rate (FCR) for KPI distribution estimates while substantially improving sample efficiency. Empirical validation across synthetic data, large language model text-to-SQL decoding, and telecommunications network performance assessment demonstrates the method’s effectiveness, facilitating reliable comparisons among multiple candidate configurations under varying reliability guarantees.

Technology Category

Application Category

📝 Abstract
Formal model evaluation methods typically certify that a model satisfies a prescribed target key performance indicator (KPI) level. However, in many applications, the relevant target KPI level may not be known a priori, and the user may instead wish to compare candidate models by analyzing the full trade-offs between performance and reliability achievable at test time by the models. This task, requiring the reliable estimate of the test-time KPI distributions, is made more complicated by the fact that the same data must often be used both to pre-select a subset of candidate models and to estimate their KPI distributions, causing a potential post-selection bias. In this work, we introduce post-selection distributional model evaluation (PS-DME), a general framework for statistically valid distributional model assessment after arbitrary data-dependent model pre-selection. Building on e-values, PS-DME controls post-selection false coverage rate (FCR) for the distributional KPI estimates and is proved to be more sample efficient than a baseline method based on sample splitting. Experiments on synthetic data, text-to-SQL decoding with large language models, and telecom network performance evaluation demonstrate that PS-DME enables reliable comparison of candidate configurations across a range of reliability levels, supporting the statistically reliable exploration of performance--reliability trade-offs.
Problem

Research questions and friction points this paper is trying to address.

post-selection bias
distributional model evaluation
key performance indicator (KPI)
model selection
performance-reliability trade-offs
Innovation

Methods, ideas, or system contributions that make the work stand out.

post-selection inference
distributional evaluation
e-values
false coverage rate
model selection
🔎 Similar Papers
No similar papers found.