Coreset selection for the Sinkhorn divergence and generic smooth divergences

📅 2025-04-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the efficient construction of convexly weighted coresets for Sinkhorn divergence and general smooth divergences. We propose CO², a novel algorithm that reformulates coreset selection as a minimum Maximum Mean Discrepancy (MMD) problem via second-order Taylor expansion of the target objective. CO² achieves high-fidelity approximation of the full-dataset Sinkhorn divergence using only O(log n) samples—substantially improving upon conventional O(n) random sampling. Theoretically, we establish the first formal connections between coreset selection and kernel quadrature, moment matching, and score matching; uncover new regularity properties induced by entropy regularization in optimal transport; and design the first log-sample-complexity coreset sampling scheme for Sinkhorn divergence with rigorous theoretical guarantees. Empirical evaluation on image sub-sampling tasks validates its effectiveness.

Technology Category

Application Category

📝 Abstract
We introduce CO2, an efficient algorithm to produce convexly-weighted coresets with respect to generic smooth divergences. By employing a functional Taylor expansion, we show a local equivalence between sufficiently regular losses and their second order approximations, reducing the coreset selection problem to maximum mean discrepancy minimization. We apply CO2 to the Sinkhorn divergence, providing a novel sampling procedure that requires logarithmically many data points to match the approximation guarantees of random sampling. To show this, we additionally verify several new regularity properties for entropically regularized optimal transport of independent interest. Our approach leads to a new perspective linking coreset selection and kernel quadrature to classical statistical methods such as moment and score matching. We showcase this method with a practical application of subsampling image data, and highlight key directions to explore for improved algorithmic efficiency and theoretical guarantees.
Problem

Research questions and friction points this paper is trying to address.

Efficient coreset selection for smooth divergences
Novel sampling for Sinkhorn divergence approximation
Linking coreset selection to kernel quadrature methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Functional Taylor expansion for coreset selection
Maximum mean discrepancy minimization approach
Logarithmic data points for Sinkhorn divergence
🔎 Similar Papers
No similar papers found.
A
Alex Kokot
Department of Statistics, University of Washington, Seattle, WA 98195-4322, USA
Alex Luedtke
Alex Luedtke
Member of the Faculty, Harvard University
causal inferencemachine learningsemiparametricsautomated estimation