Advanced Tutorial: Label-Efficient Two-Sample Tests

📅 2025-01-07

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

In settings where label acquisition is costly, classical two-sample goodness-of-fit tests—typically requiring fully labeled samples—are impractical. Method: This paper introduces active learning into nonparametric two-sample testing for the first time, proposing a unified theoretical framework that jointly optimizes label efficiency and statistical validity. Our approach integrates U-statistic construction, bias correction, and an adaptive active querying strategy to maximize test power under strict label budget constraints. Contribution/Results: We establish rigorous finite-sample control of Type-I error and derive asymptotic power guarantees. Empirical evaluation on real-world tasks—including medical controlled studies—demonstrates substantial improvements over conventional passive two-sample tests, achieving both statistical rigor and practical deployability without compromising interpretability or theoretical soundness.

Technology Category

Application Category

📝 Abstract

Hypothesis testing is a statistical inference approach used to determine whether data supports a specific hypothesis. An important type is the two-sample test, which evaluates whether two sets of data points are from identical distributions. This test is widely used, such as by clinical researchers comparing treatment effectiveness. This tutorial explores two-sample testing in a context where an analyst has many features from two samples, but determining the sample membership (or labels) of these features is costly. In machine learning, a similar scenario is studied in active learning. This tutorial extends active learning concepts to two-sample testing within this extit{label-costly} setting while maintaining statistical validity and high testing power. Additionally, the tutorial discusses practical applications of these label-efficient two-sample tests.

Problem

Research questions and friction points this paper is trying to address.

Sample Comparison

Distribution Testing

Limited Annotation Cost

Innovation

Methods, ideas, or system contributions that make the work stand out.

Active Learning

Two-sample Test

Cost Reduction

🔎 Similar Papers

Practical Kernel Tests of Conditional Independence