Gradient Clipping Beyond Vector Norms: A Spectral Approach for Matrix-Valued Parameters

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

This work addresses the limitation of conventional gradient clipping, which disregards the intrinsic matrix structure of neural network parameters and struggles to mitigate the pathological amplification of gradient singular values under heavy-tailed noise. The authors propose spectral clipping—a novel approach that generalizes clipping from vector norms to the matrix spectral domain—by adaptively truncating excessively large singular values while preserving gradient directionality. The method integrates layer-wise adaptive thresholds, sliding-window quantile estimation, and efficient randomized truncated SVD, and is accompanied by theoretical convergence guarantees in non-convex optimization settings. Empirical evaluations demonstrate that spectral clipping achieves consistently stable and competitive performance on both synthetic heavy-tailed data and real-world neural network training tasks, confirming its effectiveness and scalability.

📝 Abstract

Gradient clipping is a standard safeguard for training neural networks under noisy, heavy-tailed stochastic gradients; yet, most clipping rules treat all parameters as vectors and ignore the matrix structure of modern architectures. We show empirically that data outliers often amplify only a small number of leading singular values in layer-wise gradient matrices, while the rest of the spectrum remains largely unchanged. Motivated by this phenomenon, we propose spectral clipping, which stabilizes training by clamping singular values that exceed a threshold while preserving the singular directions. This framework generalizes classical gradient norm clipping and can be easily integrated into existing optimizers. We provide a convergence analysis for non-convex optimization with spectrally clipped SGD, yielding the optimal $\mathcal{O}\left(K^{\frac{2 - 2α}{3α- 2}}\right)$ rate for heavy-tailed noise. To minimize hyperparameter tuning, we introduce layer-wise adaptive thresholds based on moving averages or sliding-window quantiles of the top singular values. Finally, we develop efficient implementations that clip only the top $r$ singular values via randomized truncated SVD, avoiding full decompositions for large layers. We demonstrate competitive performance across synthetic heavy-tailed settings and neural network training tasks.

Problem

Research questions and friction points this paper is trying to address.

gradient clipping

matrix-valued parameters

singular values

heavy-tailed noise

spectral structure

Innovation

Methods, ideas, or system contributions that make the work stand out.

spectral clipping

matrix-valued gradients

singular value thresholding