🤖 AI Summary
This work addresses the computational inefficiency of Score-debiased Kernel Density Estimation (SD-KDE), which, despite its superior convergence properties, suffers from poor scalability due to its reliance on empirical scores. To overcome this limitation, we present the first integration of GPU Tensor Cores into SD-KDE by reformulating the algorithm to expose its inherent matrix-multiplication structure, thereby enabling highly parallelized computation. Our approach achieves dramatic speedups without compromising estimation accuracy: it is 47× faster than the current strongest baseline and 3,300× faster than scikit-learn on a 32K-sample, 16-dimensional task. Moreover, on a 1M-sample, 16-dimensional setting, a single GPU completes 131K queries in just 2.3 seconds, establishing the practical feasibility of high-dimensional, large-scale density estimation for the first time.
📝 Abstract
Score-debiased kernel density estimation (SD-KDE) achieves improved asymptotic convergence rates over classical KDE, but its use of an empirical score has made it significantly slower in practice. We show that by re-ordering the SD-KDE computation to expose matrix-multiplication structure, Tensor Cores can be used to accelerate the GPU implementation. On a 32k-sample 16-dimensional problem, our approach runs up to $47\times$ faster than a strong SD-KDE GPU baseline and $3{,}300\times$ faster than scikit-learn's KDE. On a larger 1M-sample 16-dimensional task evaluated on 131k queries, Flash-SD-KDE completes in $2.3$ s on a single GPU, making score-debiased density estimation practical at previously infeasible scales.