đ¤ AI Summary
Existing analyses of stochastic gradient descent (SGD) with tail-averaging in overparameterized linear regression rely heavily on advanced operator-theoretic toolsâparticularly for handling higher-order positive semidefinite matrix operatorsârendering them opaque and difficult to extend.
Method: We propose a simplified, elementary linear algebraâbased framework that circumvents operator-theoretic machinery entirely. Our analysis leverages only biasâvariance decomposition and recursive dynamical modeling of SGD iterates.
Contribution/Results: The framework rigorously recovers state-of-the-art risk bounds without invoking high-order operators, significantly enhancing interpretability and extensibility. It provides a unified, concise, and easily generalizable theoretical foundation for key variantsâincluding mini-batch SGD and adaptive learning rate schedulesâthereby facilitating the extension of SGD theory to broader optimization settings beyond linear regression.
đ Abstract
Theoretically understanding stochastic gradient descent (SGD) in overparameterized models has led to the development of several optimization algorithms that are widely used in practice today. Recent work by~citet{zou2021benign} provides sharp rates for SGD optimization in linear regression using constant learning rate, both with and without tail iterate averaging, based on a bias-variance decomposition of the risk. In our work, we provide a simplified analysis recovering the same bias and variance bounds provided in~citep{zou2021benign} based on simple linear algebra tools, bypassing the requirement to manipulate operators on positive semi-definite (PSD) matrices. We believe our work makes the analysis of SGD on linear regression very accessible and will be helpful in further analyzing mini-batching and learning rate scheduling, leading to improvements in the training of realistic models.