🤖 AI Summary
This work explores the effective integration of kernel methods into deep learning architectures by introducing “Sparse Kernels”—a differentiable, localized, and lazy variant of kernel ridge regression. The approach decouples feature representations, target values, and evaluation points into learnable or fixed parameters and implements them as modular, end-to-end trainable layers in PyTorch. By deferring training to inference time, enabling training-free transfer, and facilitating hybrid kernel–neural models, this method substantially expands the design space of deep learning. Experiments demonstrate that Sparse Kernel modules achieve performance comparable to neural readouts while significantly reducing training costs across convolutional networks, Vision Transformers, and reinforcement learning settings, and can also serve as plug-and-play components to enhance existing models.
📝 Abstract
Deep neural networks dominate modern machine learning, while alternative function approximators remain comparatively underexplored at scale. In this work, we revisit kernel methods as drop-in components for standard deep learning pipelines. We introduce \emph{Sparse Kernels} (SKs), a differentiable, localized, and lazy variant of kernel ridge regression (KRR) that defers training to inference time and reduces to the solution of small local systems. We integrate SKs into PyTorch as modular layers that preserve end-to-end trainability, and we show that they expose three distinct sets of parameters -- feature representations, target values, and evaluation points -- each of which can be fixed or learned. This decomposition broadens the design space available to practitioners, enabling, in particular, training-free transfer, nonlinear probing, and hybrid kernel-neural models. Across convolutional networks, vision transformers, and reinforcement learning, SK-based modules serve two complementary roles: in some settings, they match the performance of trained neural readouts with substantially less training; in others, they augment existing models and improve their performance when used as additional components. Our results suggest that kernel methods, once made scalable and differentiable, can be readily integrated with deep learning rather than treated as a separate paradigm.