Transformers Learn Low Sensitivity Functions: Investigations and Implications

📅 2024-03-11
📈 Citations: 5
Influential: 0
📄 PDF
🤖 AI Summary
This paper investigates Transformer sensitivity to input token-level random perturbations to characterize its fundamental inductive bias. Method: We quantify sensitivity via perturbation analysis, Neural Tangent Kernel (NTK) spectral decomposition, adversarial evaluation, and loss landscape geometry modeling—across diverse vision and language tasks. Contribution/Results: We discover that Transformers exhibit significantly lower perturbation sensitivity than MLPs, CNNs, ConvMixers, and LSTMs, establishing “low sensitivity” as a cross-modal, unified inductive bias. This property empirically correlates with enhanced robustness, flatter minima, and grokking dynamics—and enables retraining-free robustness improvement. Theoretically, we provide a spectral bias explanation within the NTK framework; empirically, we demonstrate strong negative correlations between sensitivity and both generalization performance and optimization convergence speed.

Technology Category

Application Category

📝 Abstract
Transformers achieve state-of-the-art accuracy and robustness across many tasks, but an understanding of their inductive biases and how those biases differ from other neural network architectures remains elusive. In this work, we identify the sensitivity of the model to token-wise random perturbations in the input as a unified metric which explains the inductive bias of transformers across different data modalities and distinguishes them from other architectures. We show that transformers have lower sensitivity than MLPs, CNNs, ConvMixers and LSTMs, across both vision and language tasks. We also show that this low-sensitivity bias has important implications: i) lower sensitivity correlates with improved robustness; it can also be used as an efficient intervention to further improve the robustness of transformers; ii) it corresponds to flatter minima in the loss landscape; and iii) it can serve as a progress measure for grokking. We support these findings with theoretical results showing (weak) spectral bias of transformers in the NTK regime, and improved robustness due to the lower sensitivity. The code is available at https://github.com/estija/sensitivity.
Problem

Research questions and friction points this paper is trying to address.

Transformers' low sensitivity to input perturbations
Inductive bias differences in neural architectures
Low sensitivity improves model robustness and grokking
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformers exhibit low sensitivity
Low sensitivity enhances model robustness
Spectral bias supports transformer efficiency
Bhavya Vasudeva
Bhavya Vasudeva
University of Southern California
Machine LearningOptimizationRobustness
Deqing Fu
Deqing Fu
USC
Machine LearningTheoryNatural Language Processing
T
Tianyi Zhou
University of Southern California
E
Elliott Kau
University of Southern California
Y
Youqi Huang
University of Southern California
Vatsal Sharan
Vatsal Sharan
Stanford University