Bilateral Distribution Compression: Reducing Both Data Size and Dimensionality

📅 2025-09-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing distribution compression methods compress only the number of samples, rendering them inadequate for high-dimensional, large-scale data. To address this, we propose a bilateral distribution compression framework that simultaneously compresses both the sample count and the data dimensionality while preserving the underlying distributional characteristics. Our method introduces a decoding Maximum Mean Discrepancy (MMD) metric to uniformly quantify compression quality. It employs a two-stage optimization: first, learning a low-dimensional projection via reconstruction MMD; second, optimizing a compact latent-space compression set using encoding MMD. Theoretically and empirically, the framework achieves linear time and memory complexity. It significantly reduces computational overhead while matching or surpassing the compression performance of state-of-the-art methods. This work establishes a novel paradigm for efficient distribution approximation of large-scale, high-dimensional data.

Technology Category

Application Category

📝 Abstract
Existing distribution compression methods reduce dataset size by minimising the Maximum Mean Discrepancy (MMD) between original and compressed sets, but modern datasets are often large in both sample size and dimensionality. We propose Bilateral Distribution Compression (BDC), a two-stage framework that compresses along both axes while preserving the underlying distribution, with overall linear time and memory complexity in dataset size and dimension. Central to BDC is the Decoded MMD (DMMD), which quantifies the discrepancy between the original data and a compressed set decoded from a low-dimensional latent space. BDC proceeds by (i) learning a low-dimensional projection using the Reconstruction MMD (RMMD), and (ii) optimising a latent compressed set with the Encoded MMD (EMMD). We show that this procedure minimises the DMMD, guaranteeing that the compressed set faithfully represents the original distribution. Experiments show that across a variety of scenarios BDC can achieve comparable or superior performance to ambient-space compression at substantially lower cost.
Problem

Research questions and friction points this paper is trying to address.

Compressing datasets with large sample size and high dimensionality
Reducing both data size and dimensionality simultaneously
Preserving underlying data distribution during compression process
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage framework compresses data size and dimensionality
Uses Decoded MMD to quantify distribution discrepancy
Learns low-dimensional projection then optimizes latent compression
🔎 Similar Papers
No similar papers found.
D
Dominic Broadbent
School of Mathematics, University of Bristol, Bristol, United Kingdom
Nick Whiteley
Nick Whiteley
University of Bristol
Topological and Geometric Data AnalysisNetworks and GraphsUncertainty Dynamics
R
Robert Allison
School of Mathematics, University of Bristol, Bristol, United Kingdom
T
Tom Lovett
Mathematical Institute, University of Oxford, Oxford, United Kingdom