An Efficient Permutation-Based Kernel Two-Sample Test

📅 2025-02-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Nonparametric two-sample testing under large-scale data—efficiently determining whether two samples are drawn from the same distribution while preserving statistical reliability—remains computationally challenging. Method: We propose a fast permutation-based Maximum Mean Discrepancy (MMD) test leveraging Nyström kernel approximation. Contribution/Results: This is the first method to theoretically guarantee, under mild conditions, finite-sample statistical power matching the minimax optimal separation rate—while maintaining test validity. By approximating the kernel matrix with rank-$m$ ($m ll n$), the computational complexity of the MMD statistic is reduced from $O(n^2)$ to $O(nm)$, enabling scalability without sacrificing the model-free robustness of permutation testing. Empirical evaluation on real scientific datasets demonstrates that the proposed test achieves accuracy approaching the theoretical optimum, alongside substantial speedups—up to orders of magnitude—over exact MMD permutation tests.

Technology Category

Application Category

📝 Abstract
Two-sample hypothesis testing-determining whether two sets of data are drawn from the same distribution-is a fundamental problem in statistics and machine learning with broad scientific applications. In the context of nonparametric testing, maximum mean discrepancy (MMD) has gained popularity as a test statistic due to its flexibility and strong theoretical foundations. However, its use in large-scale scenarios is plagued by high computational costs. In this work, we use a Nystr""om approximation of the MMD to design a computationally efficient and practical testing algorithm while preserving statistical guarantees. Our main result is a finite-sample bound on the power of the proposed test for distributions that are sufficiently separated with respect to the MMD. The derived separation rate matches the known minimax optimal rate in this setting. We support our findings with a series of numerical experiments, emphasizing realistic scientific data.
Problem

Research questions and friction points this paper is trying to address.

Efficient two-sample hypothesis testing
Reducing MMD computational costs
Preserving statistical guarantees
Innovation

Methods, ideas, or system contributions that make the work stand out.

Nyström approximation
efficient testing algorithm
finite-sample bound
🔎 Similar Papers
A
Antoine Chatalic
MaLGa Center - DIBRIS, Università di Genova, Genoa, Italy; CNRS, Univ. Grenoble-Alpes, GIPSA-lab, France
M
Marco Letizia
MaLGa Center - DIBRIS, Università di Genova, Genoa, Italy; INFN - Sezione di Genova, Genoa, Italy
Nicolas Schreuder
Nicolas Schreuder
CNRS, LIGM
StatisticsMachine Learning
Lorenzo Rosasco
Lorenzo Rosasco
MaLGa Machine Learning Genoa Center - Università degli Studi di Genova
learning theorymachine learning