🤖 AI Summary
This work investigates the feasibility and initialization sensitivity of training standard neural networks via gradient descent to learn nearly-complete parity functions (where (k = d - O(1))). Theoretically, Rademacher initialization enables efficient learning, whereas introducing merely Gaussian perturbations with standard deviation (sigma = O(d^{-1})) causes failure—establishing, for the first time, a sharp robustness threshold for initialization. The authors introduce the novel analytical framework of *initial gradient alignment* to model gradient descent dynamics, rigorously proving successful learning under Rademacher initialization and deriving a tight critical (sigma) boundary. Beyond providing new evidence of *unlearnability* for a fixed target function—distinct from statistical query lower bounds—the work is the first to demonstrate the indispensable role of discrete (e.g., Rademacher) initialization in learning high-order Boolean functions. These results offer crucial theoretical insight into implicit bias in deep learning.
📝 Abstract
Parities have become a standard benchmark for evaluating learning algorithms. Recent works show that regular neural networks trained by gradient descent can efficiently learn degree $k$ parities on uniform inputs for constant $k$, but fail to do so when $k$ and $d-k$ grow with $d$ (here $d$ is the ambient dimension). However, the case where $k=d-O_d(1)$ (almost-full parities), including the degree $d$ parity (the full parity), has remained unsettled. This paper shows that for gradient descent on regular neural networks, learnability depends on the initial weight distribution. On one hand, the discrete Rademacher initialization enables efficient learning of almost-full parities, while on the other hand, its Gaussian perturbation with large enough constant standard deviation $sigma$ prevents it. The positive result for almost-full parities is shown to hold up to $sigma=O(d^{-1})$, pointing to questions about a sharper threshold phenomenon. Unlike statistical query (SQ) learning, where a singleton function class like the full parity is trivially learnable, our negative result applies to a fixed function and relies on an initial gradient alignment measure of potential broader relevance to neural networks learning.