Kolmogorov-Arnold Attention: Is Learnable Attention Better For Vision Transformers?

📅 2025-03-13

📈 Citations: 0

✨ Influential: 0

career value

229K/year

🤖 AI Summary

Kolmogorov–Arnold Networks (KANs) lack empirical validation in vision tasks, particularly within attention mechanisms. Method: This work introduces the first KAN-based attention module for Vision Transformers (ViTs), termed Kolmogorov–Arnold Attention (KArAt), and its lightweight Fourier-basis variant, Fourier-KArAt. We design a general, plug-and-play KArAt module compatible with standard ViT architectures and supporting arbitrary basis functions; we further conduct systematic analysis of its loss landscape, spectral properties, and generalization behavior. Contribution/Results: Experiments show Fourier-KArAt matches or surpasses baseline ViTs on CIFAR-10/100 and ImageNet-1K. Theoretical analysis reveals significantly smoother loss landscapes and more concentrated weight distributions compared to standard attention—providing the first empirical evidence that KANs are both feasible and advantageous for high-dimensional visual attention modeling.

Technology Category

Application Category

📝 Abstract

Kolmogorov-Arnold networks (KANs) are a remarkable innovation consisting of learnable activation functions with the potential to capture more complex relationships from data. Although KANs are useful in finding symbolic representations and continual learning of one-dimensional functions, their effectiveness in diverse machine learning (ML) tasks, such as vision, remains questionable. Presently, KANs are deployed by replacing multilayer perceptrons (MLPs) in deep network architectures, including advanced architectures such as vision Transformers (ViTs). In this paper, we are the first to design a general learnable Kolmogorov-Arnold Attention (KArAt) for vanilla ViTs that can operate on any choice of basis. However, the computing and memory costs of training them motivated us to propose a more modular version, and we designed particular learnable attention, called Fourier-KArAt. Fourier-KArAt and its variants either outperform their ViT counterparts or show comparable performance on CIFAR-10, CIFAR-100, and ImageNet-1K datasets. We dissect these architectures' performance and generalization capacity by analyzing their loss landscapes, weight distributions, optimizer path, attention visualization, and spectral behavior, and contrast them with vanilla ViTs. The goal of this paper is not to produce parameter- and compute-efficient attention, but to encourage the community to explore KANs in conjunction with more advanced architectures that require a careful understanding of learnable activations. Our open-source code and implementation details are available on: https://subhajitmaity.me/KArAt

Problem

Research questions and friction points this paper is trying to address.

Explores effectiveness of Kolmogorov-Arnold networks in vision tasks.

Introduces learnable Kolmogorov-Arnold Attention for vision Transformers.

Proposes modular Fourier-KArAt to reduce training costs.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Kolmogorov-Arnold Attention for Vision Transformers

Proposes modular Fourier-KArAt for efficient training

Analyzes performance through loss landscapes and attention visualization

🔎 Similar Papers

No similar papers found.