Hyperbolic Dataset Distillation

📅 2025-05-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing dataset distillation methods struggle to model hierarchical structures and complex geometric relationships in large-scale datasets due to the limitations of Euclidean space. To address this, this paper introduces hyperbolic geometry—specifically the Lorentz model—to dataset distillation for the first time. Our method defines a geodesic distance–driven distribution matching mechanism in negatively curved space and explicitly aligns the hyperbolic centroids of synthetic and original datasets, enabling hierarchical-aware compact data synthesis. Empirically, it retains or even improves model performance and training stability using only 20% of core samples. The approach consistently outperforms state-of-the-art methods across multiple benchmark datasets, seamlessly integrates with mainstream distribution matching frameworks, and establishes a novel paradigm for efficient, scalable dataset compression.

Technology Category

Application Category

📝 Abstract
To address the computational and storage challenges posed by large-scale datasets in deep learning, dataset distillation has been proposed to synthesize a compact dataset that replaces the original while maintaining comparable model performance. Unlike optimization-based approaches that require costly bi-level optimization, distribution matching (DM) methods improve efficiency by aligning the distributions of synthetic and original data, thereby eliminating nested optimization. DM achieves high computational efficiency and has emerged as a promising solution. However, existing DM methods, constrained to Euclidean space, treat data as independent and identically distributed points, overlooking complex geometric and hierarchical relationships. To overcome this limitation, we propose a novel hyperbolic dataset distillation method, termed HDD. Hyperbolic space, characterized by negative curvature and exponential volume growth with distance, naturally models hierarchical and tree-like structures. HDD embeds features extracted by a shallow network into the Lorentz hyperbolic space, where the discrepancy between synthetic and original data is measured by the hyperbolic (geodesic) distance between their centroids. By optimizing this distance, the hierarchical structure is explicitly integrated into the distillation process, guiding synthetic samples to gravitate towards the root-centric regions of the original data distribution while preserving their underlying geometric characteristics. Furthermore, we find that pruning in hyperbolic space requires only 20% of the distilled core set to retain model performance, while significantly improving training stability. Notably, HDD is seamlessly compatible with most existing DM methods, and extensive experiments on different datasets validate its effectiveness.
Problem

Research questions and friction points this paper is trying to address.

Addresses computational challenges in large-scale deep learning datasets
Overcomes Euclidean space limitations in dataset distillation
Integrates hierarchical structures into synthetic data generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hyperbolic space embeds hierarchical data structures
Lorentz model measures centroid geodesic distance
Pruning retains performance with 20% core set
🔎 Similar Papers
No similar papers found.