🤖 AI Summary
This paper addresses the high computational cost and low statistical efficiency of kernel methods in supervised learning—particularly for tasks like Monte Carlo integration—by introducing the first adaptation of Kernel Thinning (KT) to supervised settings, yielding two novel estimators: KT-NW (Kernel Thinning–Nadaraya–Watson) and KT-KRR (Kernel Thinning–Kernel Ridge Regression). Methodologically, it designs regression-aware kernel functions to perform distribution-aware compression of labeled data, preserving essential statistical structure while drastically reducing dataset size. Theoretically, it establishes the first multiplicative error bound for KT tailored to supervised learning, jointly guaranteeing statistical accuracy and computational efficiency. Empirically, the proposed methods achieve quadratic speedups in both training and inference on synthetic and real-world benchmarks; they significantly outperform i.i.d. subsampling in statistical error and closely approximate the performance of full-data models.
📝 Abstract
The kernel thinning algorithm of Dwivedi&Mackey (2024) provides a better-than-i.i.d. compression of a generic set of points. By generating high-fidelity coresets of size significantly smaller than the input points, KT is known to speed up unsupervised tasks like Monte Carlo integration, uncertainty quantification, and non-parametric hypothesis testing, with minimal loss in statistical accuracy. In this work, we generalize the KT algorithm to speed up supervised learning problems involving kernel methods. Specifically, we combine two classical algorithms--Nadaraya-Watson (NW) regression or kernel smoothing, and kernel ridge regression (KRR)--with KT to provide a quadratic speed-up in both training and inference times. We show how distribution compression with KT in each setting reduces to constructing an appropriate kernel, and introduce the Kernel-Thinned NW and Kernel-Thinned KRR estimators. We prove that KT-based regression estimators enjoy significantly superior computational efficiency over the full-data estimators and improved statistical efficiency over i.i.d. subsampling of the training data. En route, we also provide a novel multiplicative error guarantee for compressing with KT. We validate our design choices with both simulations and real data experiments.