π€ AI Summary
This paper studies nonparametric regression of low-degree spherical harmonics (degree ββ = Ξ(1)) on the unit sphere. We propose an overparameterized two-layer neural network equipped with learnable channel-wise attention and design a two-stage gradient descent algorithm: in Stage I, a single joint update identifies the true degree ββ exactly by pruning from an initial overparameterized width L β₯ ββ; in Stage II, activations are frozen and standard optimization proceeds. Leveraging spherical harmonic analysis and nonparametric statistical theory, we prove that the method achieves sample complexity Ξ(d^{ββ}/Ξ΅), matching the minimax optimal rate; moreover, the regression risk converges in probability β₯ 1βΞ΄ to Ξ(d^{ββ}/n). This result substantially improves upon prior work and is information-theoretically unimprovable. To our knowledge, it is the first theoretical demonstration of attention-driven neural networks achieving minimax-optimal sample efficiency for spherical nonparametric regression.
π Abstract
We study the problem of learning a low-degree spherical polynomial of degree $ell_0 = Ξ(1) ge 1$ defined on the unit sphere in $RR^d$ by training an over-parameterized two-layer neural network (NN) with channel attention in this paper. Our main result is the significantly improved sample complexity for learning such low-degree polynomials. We show that, for any regression risk $eps in (0,1)$, a carefully designed two-layer NN with channel attention and finite width of $m ge Ξ({n^4 log (2n/Ξ΄)}/{d^{2ell_0}})$ trained by the vanilla gradient descent (GD) requires the lowest sample complexity of $n asymp Ξ(d^{ell_0}/eps)$ with probability $1-Ξ΄$ for every $Ξ΄in (0,1)$, in contrast with the representative sample complexity $Ξpth{d^{ell_0} maxset{eps^{-2},log d}}$, where $n$ is the training daata size. Moreover, such sample complexity is not improvable since the trained network renders a sharp rate of the nonparametric regression risk of the order $Ξ(d^{ell_0}/{n})$ with probability at least $1-Ξ΄$. On the other hand, the minimax optimal rate for the regression risk with a kernel of rank $Ξ(d^{ell_0})$ is $Ξ(d^{ell_0}/{n})$, so that the rate of the nonparametric regression risk of the network trained by GD is minimax optimal. The training of the two-layer NN with channel attention consists of two stages. In Stage 1, a provable learnable channel selection algorithm identifies the ground-truth channel number $ell_0$ from the initial $L ge ell_0$ channels in the first-layer activation, with high probability. This learnable selection is achieved by an efficient one-step GD update on both layers, enabling feature learning for low-degree polynomial targets. In Stage 2, the second layer is trained by standard GD using the activation function with the selected channels.