Beyond Pixels: Efficient Dataset Distillation via Sparse Gaussian Representation

📅 2025-09-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional dataset distillation relies on dense pixel representations, resulting in high redundancy and poor scalability. To address this, we propose GSDD—the first framework to introduce 2D sparse Gaussian primitives into dataset distillation—encoding discriminative image information via a small set of learnable Gaussian parameters, thereby significantly improving storage efficiency and training scalability. Our method employs CUDA-accelerated, differentiable splatting rendering, enabling end-to-end optimization. Crucially, it replaces dense pixel-based representation with geometrically sparse Gaussian parameterization, enhancing coverage of hard examples and inter-class diversity. GSDD achieves state-of-the-art performance on CIFAR-10, CIFAR-100, and ImageNet subsets, while drastically reducing encoding/decoding overhead. The approach is both computationally efficient and broadly applicable across standard vision benchmarks.

Technology Category

Application Category

📝 Abstract
Dataset distillation has emerged as a promising paradigm that synthesizes compact, informative datasets capable of retaining the knowledge of large-scale counterparts, thereby addressing the substantial computational and storage burdens of modern model training. Conventional approaches typically rely on dense pixel-level representations, which introduce redundancy and are difficult to scale up. In this work, we propose GSDD, a novel and efficient sparse representation for dataset distillation based on 2D Gaussians. Instead of representing all pixels equally, GSDD encodes critical discriminative information in a distilled image using only a small number of Gaussian primitives. This sparse representation could improve dataset diversity under the same storage budget, enhancing coverage of difficult samples and boosting distillation performance. To ensure both efficiency and scalability, we adapt CUDA-based splatting operators for parallel inference and training, enabling high-quality rendering with minimal computational and memory overhead. Our method is simple yet effective, broadly applicable to different distillation pipelines, and highly scalable. Experiments show that GSDD achieves state-of-the-art performance on CIFAR-10, CIFAR-100, and ImageNet subsets, while remaining highly efficient encoding and decoding cost. Our code is available at https://github.com/j-cyoung/GSDatasetDistillation.
Problem

Research questions and friction points this paper is trying to address.

Proposes sparse Gaussian representation to replace dense pixel-level dataset distillation
Improves dataset diversity and coverage of difficult samples under storage constraints
Enables efficient parallel training with CUDA-based splatting operators for scalability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses sparse 2D Gaussian primitives for dataset distillation
Employs CUDA splatting operators for parallel training
Encodes discriminative information with minimal Gaussian elements
C
Chenyang Jiang
Harbin Institute of Technology, Shenzhen
Z
Zhengcen Li
Harbin Institute of Technology, Shenzhen
H
Hang Zhao
Harbin Institute of Technology, Shenzhen
Q
Qiben Shan
Peng Cheng Laboratory
S
Shaocong Wu
Peng Cheng Laboratory
Jingyong Su
Jingyong Su
Professor, Harbin Institute of Technology at Shenzhen, China
Computer Vision and MultimodalData-Centric MLMedical Image AnalysisStatistics on Manifold