🤖 AI Summary
This work reveals the inherent difficulty of learning fixed parity functions—such as the full-coordinate XOR—using single-hidden-layer ReLU networks trained via perturbed gradient descent. For this class of balanced Boolean functions, we provide the first rigorous proof that standard gradient-based optimization provably fails to converge on *any* minimal-size ReLU network realizing a fixed parity function. Methodologically, we integrate Fourier analysis with computational learning theory to derive a novel decay bound on the Fourier coefficients of linear threshold functions; this bound bridges statistical query lower bounds with empirical training failure. Our results demonstrate that the observed learning failure is not due to algorithmic shortcomings, but stems from a fundamental mismatch between the ReLU network architecture and gradient-based optimization—highlighting an intrinsic tension between representational capacity and trainability. This provides a new theoretical lens for understanding the dual expressivity–optimization bottlenecks in deep learning.
📝 Abstract
Learning parity functions is a canonical problem in learning theory, which although computationally tractable, is not amenable to standard learning algorithms such as gradient-based methods. This hardness is usually explained via statistical query lower bounds [Kearns, 1998]. However, these bounds only imply that for any given algorithm, there is some worst-case parity function that will be hard to learn. Thus, they do not explain why fixed parities - say, the full parity function over all coordinates - are difficult to learn in practice, at least with standard predictors and gradient-based methods [Abbe and Boix-Adsera, 2022]. In this paper, we address this open problem, by showing that for any fixed parity of some minimal size, using it as a target function to train one-hidden-layer ReLU networks with perturbed gradient descent will fail to produce anything meaningful. To establish this, we prove a new result about the decay of the Fourier coefficients of linear threshold (or weighted majority) functions, which may be of independent interest.