🤖 AI Summary
This work studies the generalization behavior of overparameterized ridge regression for high-dimensional binary classification under label-flip noise and anisotropic cluster-structured data. Leveraging high-dimensional probability theory and anisotropic random matrix theory, we characterize the classification error’s joint dependence on cluster-mean magnitude and the effective rank of the covariance tail. We establish, for the first time, precise necessary and sufficient conditions for “benign overfitting” in binary classification via ridge regression—and prove their equivalence to those in linear regression. Furthermore, we show that while label noise shifts the geometry of the minimum-norm interpolating solution, it preserves its qualitative generalization behavior. Our theoretical framework quantitatively disentangles the bias mechanism induced by noise and delineates sharp generalization boundaries. This provides a novel perspective on the robustness of linear classifiers operating on structured, high-dimensional data.
📝 Abstract
In this work, we investigate the behavior of ridge regression in an overparameterized binary classification task. We assume examples are drawn from (anisotropic) class-conditional cluster distributions with opposing means and we allow for the training labels to have a constant level of label-flipping noise. We characterize the classification error achieved by ridge regression under the assumption that the covariance matrix of the cluster distribution has a high effective rank in the tail. We show that ridge regression has qualitatively different behavior depending on the scale of the cluster mean vector and its interaction with the covariance matrix of the cluster distributions. In regimes where the scale is very large, the conditions that allow for benign overfitting turn out to be the same as those for the regression task. We additionally provide insights into how the introduction of label noise affects the behavior of the minimum norm interpolator (MNI). The optimal classifier in this setting is a linear transformation of the cluster mean vector and in the noiseless setting the MNI approximately learns this transformation. On the other hand, the introduction of label noise can significantly change the geometry of the solution while preserving the same qualitative behavior.