🤖 AI Summary
Existing analyses of stochastic gradient descent (SGD) with tail-averaging in overparameterized linear regression rely heavily on advanced operator-theoretic tools—particularly for handling higher-order positive semidefinite matrix operators—rendering them opaque and difficult to extend.
Method: We propose a simplified, elementary linear algebra–based framework that circumvents operator-theoretic machinery entirely. Our analysis leverages only bias–variance decomposition and recursive dynamical modeling of SGD iterates.
Contribution/Results: The framework rigorously recovers state-of-the-art risk bounds without invoking high-order operators, significantly enhancing interpretability and extensibility. It provides a unified, concise, and easily generalizable theoretical foundation for key variants—including mini-batch SGD and adaptive learning rate schedules—thereby facilitating the extension of SGD theory to broader optimization settings beyond linear regression.
📝 Abstract
Theoretically understanding stochastic gradient descent (SGD) in overparameterized models has led to the development of several optimization algorithms that are widely used in practice today. Recent work by~citet{zou2021benign} provides sharp rates for SGD optimization in linear regression using constant learning rate, both with and without tail iterate averaging, based on a bias-variance decomposition of the risk. In our work, we provide a simplified analysis recovering the same bias and variance bounds provided in~citep{zou2021benign} based on simple linear algebra tools, bypassing the requirement to manipulate operators on positive semi-definite (PSD) matrices. We believe our work makes the analysis of SGD on linear regression very accessible and will be helpful in further analyzing mini-batching and learning rate scheduling, leading to improvements in the training of realistic models.