A Dataset is Worth 1 MB

📅 2026-02-26

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

This work addresses the challenge of high communication costs in transferring pre-trained models across heterogeneous clients that cannot share raw data. To this end, the authors propose PLADA, a novel approach that enables task knowledge transfer without transmitting any pixel-level data for the first time. Leveraging a pre-loaded, generic unlabeled reference dataset, PLADA transmits only pseudolabels for a small set of target-task images and combines semantic relevance pruning, data distillation, and local fine-tuning to efficiently convey task-specific knowledge. Extensive experiments across ten diverse datasets demonstrate that PLADA achieves high classification accuracy with less than 1 MB of transmitted label data, drastically reducing communication overhead while maintaining training efficiency and imposing minimal transmission burden.

Technology Category

Application Category

📝 Abstract

A dataset server must often distribute the same large payload to many clients, incurring massive communication costs. Since clients frequently operate on diverse hardware and software frameworks, transmitting a pre-trained model is often infeasible; instead, agents require raw data to train their own task-specific models locally. While dataset distillation attempts to compress training signals, current methods struggle to scale to high-resolution data and rarely achieve sufficiently small files. In this paper, we propose Pseudo-Labels as Data (PLADA), a method that completely eliminates pixel transmission. We assume agents are preloaded with a large, generic, unlabeled reference dataset (e.g., ImageNet-1K, ImageNet-21K) and communicate a new task by transmitting only the class labels for specific images. To address the distribution mismatch between the reference and target datasets, we introduce a pruning mechanism that filters the reference dataset to retain only the labels of the most semantically relevant images for the target task. This selection process simultaneously maximizes training efficiency and minimizes transmission payload. Experiments on 10 diverse datasets demonstrate that our approach can transfer task knowledge with a payload of less than 1 MB while retaining high classification accuracy, offering a promising solution for efficient dataset serving.

Problem

Research questions and friction points this paper is trying to address.

dataset distillation

data transmission

communication efficiency

label-only transfer

reference dataset

Innovation

Methods, ideas, or system contributions that make the work stand out.

dataset distillation

pseudo-labels

data transmission efficiency