Multimodal Dataset Distillation Made Simple by Prototype-Guided Data Synthesis

📅 2026-02-23

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Training large-scale vision-language datasets is prohibitively expensive, and existing compression methods suffer from severe performance degradation and limited generalization when applied to extremely small subsets. This work proposes the first training-free framework for multimodal dataset distillation: it leverages CLIP to extract aligned image-text embeddings, clusters them to construct prototypes, and then synthesizes images using the unCLIP decoder—bypassing the need for joint optimization of pixel-level and textual features. The approach does not rely on full-dataset training or any specific model architecture, and it substantially outperforms existing distillation and subset selection methods on highly compressed subsets while achieving state-of-the-art cross-architecture generalization performance.

Technology Category

Application Category

📝 Abstract

Recent advances in multimodal learning have achieved remarkable success across diverse vision-language tasks. However, such progress heavily relies on large-scale image-text datasets, making training costly and inefficient. Prior efforts in dataset filtering and pruning attempt to mitigate this issue, but still require relatively large subsets to maintain performance and fail under very small subsets. Dataset distillation offers a promising alternative, yet existing multimodal dataset distillation methods require full-dataset training and joint optimization of image pixels and text features, making them architecture-dependent and limiting cross-architecture generalization. To overcome this, we propose a learning-free dataset distillation framework that eliminates the need for large-scale training and optimization while enhancing generalization across architectures. Our method uses CLIP to extract aligned image-text embeddings, obtains prototypes, and employs an unCLIP decoder to synthesize images, enabling efficient and scalable multimodal dataset distillation. Extensive experiments demonstrate that our approach consistently outperforms optimization-based dataset distillation and subset selection methods, achieving state-of-the-art cross-architecture generalization.

Problem

Research questions and friction points this paper is trying to address.

multimodal dataset distillation

vision-language tasks

cross-architecture generalization

data efficiency

large-scale training

Innovation

Methods, ideas, or system contributions that make the work stand out.

dataset distillation

multimodal learning

prototype-guided synthesis