Average Gradient Outer Product in kernel regression provably recovers the central subspace for multi-index models

📅 2026-05-14

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work addresses the problem of recovering the unknown low-dimensional central subspace—containing all predictive information—in multi-index models from limited samples. The proposed method employs kernel ridge regression to fit the data and estimates the central subspace via the leading eigenspace of the average gradient outer product (AGOP) of the fitted function. Theoretical analysis reveals a separation between representation learning and predictive performance in terms of sample complexity: even when the sample size is far below what is required for accurate prediction (e.g., $n \sim d^{p+\delta}$, where $p$ denotes the effective order), the $r$-dimensional central subspace can still be recovered with high probability. This finding underscores the feasibility of efficient representation learning in low-sample regimes.

📝 Abstract

We study a prototypical situation when a learned predictor can discover useful low-dimensional structure in data, while using fewer samples than are needed for accurate prediction. Specifically, we consider the problem of recovering a multi-index polynomial $f^*(x)=h(Ux)$, with $U\in\mathbb{R}^{r\times d}$ and $r\ll d$, from finitely many data/label pairs. Importantly, the target function depends on input $x$ only through the projection onto an unknown $r$-dimensional central subspace. The algorithm we analyze is appealingly simple: fit kernel ridge regression (KRR) to the data and compute the Average Gradient Outer Product (AGOP) from the fitted predictor. Our main results show that under reasonable assumptions the top $r$-dimensional eigenspace of AGOP provably recovers the central subspace, even in regimes when the prediction error remains large. Specifically, if the target function $f^*$ has degree $p^*$, it is known that $n\asymp d^{p^*}$ samples are necessary for KRR to achieve accurate prediction. In contrast, we show that if a low degree $p$ component of $f^*$ already carries all relevant directions for prediction, subspace recovery occurs in the much lower sample regime $n\asymp d^{p+δ}$ for any $δ\in(0,1)$. Our results thus demonstrate a separation between prediction and representation, and provide an explanation for why iterative kernel methods such as Recursive Feature Machines (RFM) can be sample-efficient in practice.

Problem

Research questions and friction points this paper is trying to address.

central subspace

multi-index models

subspace recovery

sample efficiency

kernel regression

Innovation

Methods, ideas, or system contributions that make the work stand out.

Average Gradient Outer Product

central subspace recovery

kernel ridge regression