Shallow Neural Networks Learn Low-Degree Spherical Polynomials with Learnable Channel Attention

📅 2025-12-23

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

This paper studies nonparametric regression of low-degree spherical harmonics (degree ℓ₀ = Θ(1)) on the unit sphere. We propose an overparameterized two-layer neural network equipped with learnable channel-wise attention and design a two-stage gradient descent algorithm: in Stage I, a single joint update identifies the true degree ℓ₀ exactly by pruning from an initial overparameterized width L ≥ ℓ₀; in Stage II, activations are frozen and standard optimization proceeds. Leveraging spherical harmonic analysis and nonparametric statistical theory, we prove that the method achieves sample complexity Θ(d^{ℓ₀}/ε), matching the minimax optimal rate; moreover, the regression risk converges in probability ≥ 1−δ to Θ(d^{ℓ₀}/n). This result substantially improves upon prior work and is information-theoretically unimprovable. To our knowledge, it is the first theoretical demonstration of attention-driven neural networks achieving minimax-optimal sample efficiency for spherical nonparametric regression.

Technology Category

Application Category

📝 Abstract

We study the problem of learning a low-degree spherical polynomial of degree $ell_0 = Θ(1) ge 1$ defined on the unit sphere in $RR^d$ by training an over-parameterized two-layer neural network (NN) with channel attention in this paper. Our main result is the significantly improved sample complexity for learning such low-degree polynomials. We show that, for any regression risk $eps in (0,1)$, a carefully designed two-layer NN with channel attention and finite width of $m ge Θ({n^4 log (2n/δ)}/{d^{2ell_0}})$ trained by the vanilla gradient descent (GD) requires the lowest sample complexity of $n asymp Θ(d^{ell_0}/eps)$ with probability $1-δ$ for every $δin (0,1)$, in contrast with the representative sample complexity $Θpth{d^{ell_0} maxset{eps^{-2},log d}}$, where $n$ is the training daata size. Moreover, such sample complexity is not improvable since the trained network renders a sharp rate of the nonparametric regression risk of the order $Θ(d^{ell_0}/{n})$ with probability at least $1-δ$. On the other hand, the minimax optimal rate for the regression risk with a kernel of rank $Θ(d^{ell_0})$ is $Θ(d^{ell_0}/{n})$, so that the rate of the nonparametric regression risk of the network trained by GD is minimax optimal. The training of the two-layer NN with channel attention consists of two stages. In Stage 1, a provable learnable channel selection algorithm identifies the ground-truth channel number $ell_0$ from the initial $L ge ell_0$ channels in the first-layer activation, with high probability. This learnable selection is achieved by an efficient one-step GD update on both layers, enabling feature learning for low-degree polynomial targets. In Stage 2, the second layer is trained by standard GD using the activation function with the selected channels.

Problem

Research questions and friction points this paper is trying to address.

Learning low-degree spherical polynomials with neural networks

Improving sample complexity for polynomial regression tasks

Achieving minimax optimal rates via learnable channel attention

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-layer neural network with channel attention

Two-stage training with learnable channel selection

Minimax optimal regression risk via gradient descent

🔎 Similar Papers

No similar papers found.

Bosch Group

Renningen, BW, DE

Machine Learning Research Engineer

Booz Allen Hamilton

$99,000.00 to $225,000.00 (annualized USD)

Remote

Software Engineer, Machine Learning