🤖 AI Summary
This study addresses the challenge of efficiently estimating U-statistics under limited labeling budgets in real-world settings where label acquisition is costly, while ensuring valid statistical inference. The work introduces active learning to U-statistic estimation for the first time, proposing an active inference framework that integrates informative label querying with machine learning predictions. It designs an augmented inverse probability weighted U-statistic and derives the optimal sampling strategy that minimizes estimation variance. The proposed method substantially improves estimation efficiency, achieving comparable confidence interval coverage to baseline approaches on real datasets with significantly fewer labeled samples, and further enables empirical risk minimization based on U-statistics.
📝 Abstract
$U$-statistics play a central role in statistical inference. In many modern applications, however, acquiring the labels required for $U$-statistics is costly. Motivated by recent advances in active inference, we develop an active inference framework for $U$-statistics that selectively queries informative labels to improve estimation efficiency under a fixed labeling budget, while preserving valid statistical inference. Our approach is built on the augmented inverse probability weighting $U$-statistic, which is designed to incorporate the sampling rule and machine learning predictions. We characterize the optimal sampling rule that minimizes its variance and design practical sampling strategies. We further extend the framework to $U$-statistic-based empirical risk minimization. Experiments on real datasets demonstrate substantial gains in estimation efficiency over baseline methods, while maintaining target coverage.