🤖 AI Summary
This work addresses the lack of systematic theoretical understanding of Kolmogorov–Arnold Networks (KANs) regarding their training dynamics, generalization behavior, and differential privacy (DP) properties. Focusing on two-layer KANs trained via gradient descent, we establish convergence and generalization bounds under polynomial-logarithmic width assumptions by leveraging the separability of the Neural Tangent Kernel (NTK). Notably, we prove that this width condition is not only sufficient but also necessary under $(\varepsilon,\delta)$-DP constraints, revealing a fundamental distinction between private and non-private training regimes. Our theory shows that under logistic loss, the optimization rate is $\mathcal{O}(1/T)$ and the generalization error scales as $\mathcal{O}(1/n)$. Under $(\varepsilon,\delta)$-DP, the utility loss is bounded by $\mathcal{O}(\sqrt{d}/(n\varepsilon))$, matching the classical lower bound for convex Lipschitz problems. Experiments corroborate the practical relevance of our theoretical findings.
📝 Abstract
Kolmogorov--Arnold Networks (KANs) have recently emerged as a structured alternative to standard MLPs, yet a principled theory for their training dynamics, generalization, and privacy properties remains limited. In this paper, we analyze gradient descent (GD) for training two-layer KANs and derive general bounds that characterize their training dynamics, generalization, and utility under differential privacy (DP). As a concrete instantiation, we specialize our analysis to logistic loss under an NTK-separable assumption, where we show that polylogarithmic network width suffices for GD to achieve an optimization rate of order $1/T$ and a generalization rate of order $1/n$, with $T$ denoting the number of GD iterations and $n$ the sample size. In the private setting, we characterize the noise required for $(\epsilon,\delta)$-DP and obtain a utility bound of order $\sqrt{d}/(n\epsilon)$ (with $d$ the input dimension), matching the classical lower bound for general convex Lipschitz problems. Our results imply that polylogarithmic width is not only sufficient but also necessary under differential privacy, revealing a qualitative gap between non-private (sufficiency only) and private (necessity also emerges) training regimes. Experiments further illustrate how these theoretical insights can guide practical choices, including network width selection and early stopping.