Asymptotic Theory of Iterated Empirical Risk Minimization, with Applications to Active Learning

📅 2026-01-30

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

This work addresses the statistical dependence induced by data reuse in two-stage empirical risk minimization (ERM), a common issue in settings such as active learning. The authors propose an iterative ERM framework that avoids data splitting or oracle assumptions. Leveraging tools from high-dimensional statistics, convex optimization, and random matrix theory, they derive—for the first time—the precise asymptotic expression for test error under linear models and convex losses. Their analysis reveals a double-descent phenomenon in test error driven by data selection and characterizes the fundamental trade-off between labeling budget allocation and generalization performance. Applied to pool-based active learning, the theory accurately predicts the performance of second-stage estimators, highlighting the critical role of annotation strategies in shaping generalization error.

Technology Category

Application Category

📝 Abstract

We study a class of iterated empirical risk minimization (ERM) procedures in which two successive ERMs are performed on the same dataset, and the predictions of the first estimator enter as an argument in the loss function of the second. This setting, which arises naturally in active learning and reweighting schemes, introduces intricate statistical dependencies across samples and fundamentally distinguishes the problem from classical single-stage ERM analyses. For linear models trained with a broad class of convex losses on Gaussian mixture data, we derive a sharp asymptotic characterization of the test error in the high-dimensional regime where the sample size and ambient dimension scale proportionally. Our results provide explicit, fully asymptotic predictions for the performance of the second-stage estimator despite the reuse of data and the presence of prediction-dependent losses. We apply this theory to revisit a well-studied pool-based active learning problem, removing oracle and sample-splitting assumptions made in prior work. We uncover a fundamental tradeoff in how the labeling budget should be allocated across stages, and demonstrate a double-descent behavior of the test error driven purely by data selection, rather than model size or sample count.

Problem

Research questions and friction points this paper is trying to address.

iterated empirical risk minimization

active learning

statistical dependencies

prediction-dependent losses

high-dimensional asymptotics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Iterated ERM

Asymptotic theory

Active learning