Post-ADC Inference: Valid Inference After Active Data Collection

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This work addresses the challenge that adaptive biases introduced by active data collection strategies—such as those in sequential model-based optimization (SMBO)—invalidate conventional statistical inference. The authors propose the Post-ADC inference framework, which extends selective inference to active learning settings for the first time. Relying solely on assumptions about observation noise, Post-ADC simultaneously corrects for biases arising from both data collection and target construction. The method is broadly applicable to black-box optimization algorithms like GP-UCB and TPE, requiring no additional assumptions on the objective function or surrogate model, and enables valid p-values and confidence intervals. Empirical results demonstrate that Post-ADC substantially outperforms traditional inference methods that ignore sampling bias when applied to real-world SMBO data.

📝 Abstract

The validity of statistical inference depends critically on how data are collected. When data gathered through active data collection (ADC) are reused for a post-hoc inferential task, conventional inference can fail because the sampling is adaptively biased toward regions favored by the collection strategy. This issue is especially pronounced in black-box optimization, where sequential model-based optimization (SMBO) methods such as the tree-structured Parzen estimator (TPE) and Gaussian process upper confidence bound (GP-UCB) preferentially concentrate evaluations in promising regions. We study statistical inference on actively collected data when the inferential target is constructed in a data-dependent manner after data collection. To enable valid inference in this setting, we propose post-ADC inference, a framework that accounts for the biases arising from both the active data collection process and the subsequent data-driven target construction. Our method builds on selective inference and provides valid $p$-values and confidence intervals that correct for both sources of bias. The framework applies to a broad class of ADC processes by imposing only assumptions on the observation noise, without requiring any assumptions on the underlying black-box function or the surrogate model used by the SMBO algorithm. Empirical results also show that post-ADC inference provides valid inference for data collected by GP-UCB and TPE.

Problem

Research questions and friction points this paper is trying to address.

active data collection

post-hoc inference

adaptive bias

selective inference

statistical validity

Innovation

Methods, ideas, or system contributions that make the work stand out.

post-ADC inference

selective inference

active data collection