Flash-SD-KDE: Accelerating SD-KDE with Tensor Cores

📅 2026-02-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the computational inefficiency of Score-debiased Kernel Density Estimation (SD-KDE), which, despite its superior convergence properties, suffers from poor scalability due to its reliance on empirical scores. To overcome this limitation, we present the first integration of GPU Tensor Cores into SD-KDE by reformulating the algorithm to expose its inherent matrix-multiplication structure, thereby enabling highly parallelized computation. Our approach achieves dramatic speedups without compromising estimation accuracy: it is 47× faster than the current strongest baseline and 3,300× faster than scikit-learn on a 32K-sample, 16-dimensional task. Moreover, on a 1M-sample, 16-dimensional setting, a single GPU completes 131K queries in just 2.3 seconds, establishing the practical feasibility of high-dimensional, large-scale density estimation for the first time.

Technology Category

Application Category

📝 Abstract
Score-debiased kernel density estimation (SD-KDE) achieves improved asymptotic convergence rates over classical KDE, but its use of an empirical score has made it significantly slower in practice. We show that by re-ordering the SD-KDE computation to expose matrix-multiplication structure, Tensor Cores can be used to accelerate the GPU implementation. On a 32k-sample 16-dimensional problem, our approach runs up to $47\times$ faster than a strong SD-KDE GPU baseline and $3{,}300\times$ faster than scikit-learn's KDE. On a larger 1M-sample 16-dimensional task evaluated on 131k queries, Flash-SD-KDE completes in $2.3$ s on a single GPU, making score-debiased density estimation practical at previously infeasible scales.
Problem

Research questions and friction points this paper is trying to address.

Score-debiased KDE
Kernel Density Estimation
Computational Efficiency
Large-scale Density Estimation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Tensor Cores
Score-debiased KDE
Kernel Density Estimation
GPU acceleration
Matrix multiplication
🔎 Similar Papers
No similar papers found.
Elliot L. Epstein
Elliot L. Epstein
PhD student, Stanford University
Deep learningmachine learning
R
Rajat Vadiraj Dwaraknath
Institute for Computational and Mathematical Engineering, Stanford University
J
John Winnicki
Institute for Computational and Mathematical Engineering, Stanford University