🤖 AI Summary
This paper addresses the asymptotic risk analysis of high-dimensional logistic regression under data dependence—specifically block-wise correlation, *m*-dependence, and hybrid structures—breaking beyond the conventional independence assumption. Methodologically, it pioneers the extension of the Convex Gaussian Min-Max Theorem (CGMT) to settings where both covariates and responses exhibit dependence, integrating dependent random matrix theory with mixed-process analysis to establish a novel CGMT-based analytical framework. The theoretical contributions are threefold: (1) rigorous proof of Gaussian universality for high-dimensional logistic regression under dependent data; (2) the first asymptotic quantification of how data augmentation modifies generalization error—characterizing its precise risk-correction effect; and (3) the first provably sound high-dimensional statistical foundation for data augmentation strategies in deep learning.
📝 Abstract
Over the last decade, a wave of research has characterized the exact asymptotic risk of many high-dimensional models in the proportional regime. Two foundational results have driven this progress: Gaussian universality, which shows that the asymptotic risk of estimators trained on non-Gaussian and Gaussian data is equivalent, and the convex Gaussian min-max theorem (CGMT), which characterizes the risk under Gaussian settings. However, these results rely on the assumption that the data consists of independent random vectors--an assumption that significantly limits its applicability to many practical setups. In this paper, we address this limitation by generalizing both results to the dependent setting. More precisely, we prove that Gaussian universality still holds for high-dimensional logistic regression under block dependence, $m$-dependence and special cases of mixing, and establish a novel CGMT framework that accommodates for correlation across both the covariates and observations. Using these results, we establish the impact of data augmentation, a widespread practice in deep learning, on the asymptotic risk.