Universality of High-Dimensional Logistic Regression and a Novel CGMT under Dependence with Applications to Data Augmentation

📅 2025-02-10
🏛️ Annual Conference Computational Learning Theory
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the asymptotic risk analysis of high-dimensional logistic regression under data dependence—specifically block-wise correlation, *m*-dependence, and hybrid structures—breaking beyond the conventional independence assumption. Methodologically, it pioneers the extension of the Convex Gaussian Min-Max Theorem (CGMT) to settings where both covariates and responses exhibit dependence, integrating dependent random matrix theory with mixed-process analysis to establish a novel CGMT-based analytical framework. The theoretical contributions are threefold: (1) rigorous proof of Gaussian universality for high-dimensional logistic regression under dependent data; (2) the first asymptotic quantification of how data augmentation modifies generalization error—characterizing its precise risk-correction effect; and (3) the first provably sound high-dimensional statistical foundation for data augmentation strategies in deep learning.

Technology Category

Application Category

📝 Abstract
Over the last decade, a wave of research has characterized the exact asymptotic risk of many high-dimensional models in the proportional regime. Two foundational results have driven this progress: Gaussian universality, which shows that the asymptotic risk of estimators trained on non-Gaussian and Gaussian data is equivalent, and the convex Gaussian min-max theorem (CGMT), which characterizes the risk under Gaussian settings. However, these results rely on the assumption that the data consists of independent random vectors--an assumption that significantly limits its applicability to many practical setups. In this paper, we address this limitation by generalizing both results to the dependent setting. More precisely, we prove that Gaussian universality still holds for high-dimensional logistic regression under block dependence, $m$-dependence and special cases of mixing, and establish a novel CGMT framework that accommodates for correlation across both the covariates and observations. Using these results, we establish the impact of data augmentation, a widespread practice in deep learning, on the asymptotic risk.
Problem

Research questions and friction points this paper is trying to address.

Extending Gaussian universality to dependent data settings
Developing a novel CGMT framework for correlated covariates
Analyzing data augmentation impact on asymptotic risk
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generalized Gaussian universality for dependent data
Novel CGMT framework for correlated covariates
Analyzed data augmentation impact on risk
🔎 Similar Papers
No similar papers found.