Learning from Biased and Costly Data Sources: Minimax-optimal Data Collection under a Budget

📅 2026-02-19

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

This study addresses the problem of efficient data acquisition under a fixed budget when multiple data sources exhibit heterogeneous sampling costs and distributional shifts relative to the target population. The authors propose a sampling strategy that quantifies source–target divergence via χ² divergence and maximizes the effective sample size, coupled with a post-stratification estimator to achieve minimax optimal estimation risk for both the overall and subgroup conditional means. This approach is the first to establish theoretically optimal risk bounds simultaneously for population-level and conditional mean estimation under such constraints. Furthermore, the framework naturally extends to predictive settings, where it minimizes excess risk in downstream learning tasks.

Technology Category

Application Category

📝 Abstract

Data collection is a critical component of modern statistical and machine learning pipelines, particularly when data must be gathered from multiple heterogeneous sources to study a target population of interest. In many use cases, such as medical studies or political polling, different sources incur different sampling costs. Observations often have associated group identities (for example, health markers, demographics, or political affiliations) and the relative composition of these groups may differ substantially, both among the source populations and between sources and target population. In this work, we study multi-source data collection under a fixed budget, focusing on the estimation of population means and group-conditional means. We show that naive data collection strategies (e.g. attempting to"match"the target distribution) or relying on standard estimators (e.g. sample mean) can be highly suboptimal. Instead, we develop a sampling plan which maximizes the effective sample size: the total sample size divided by $D_{\chi^2}(q\mid\mid\overline{p}) + 1$, where $q$ is the target distribution, $\overline{p}$ is the aggregated source distribution, and $D_{\chi^2}$ is the $\chi^2$-divergence. We pair this sampling plan with a classical post-stratification estimator and upper bound its risk. We provide matching lower bounds, establishing that our approach achieves the budgeted minimax optimal risk. Our techniques also extend to prediction problems when minimizing the excess risk, providing a principled approach to multi-source learning with costly and heterogeneous data sources.

Problem

Research questions and friction points this paper is trying to address.

biased data

costly data sources

multi-source data collection

budget constraint

population mean estimation

Innovation

Methods, ideas, or system contributions that make the work stand out.

minimax optimality

costly data collection

heterogeneous sources