Dataset Distillation via Vision-Language Category Prototype

📅 2025-06-30

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

Existing dataset distillation methods predominantly rely on image features while neglecting semantic information, leading to poor generalization, logical inconsistencies, or missing critical objects. To address this, we propose a vision-language co-distillation framework: for the first time, we leverage open-source large language models (LLMs) to automatically generate class-specific textual descriptions, thereby constructing dual-modality class prototypes—comprising both visual and linguistic representations. Through cross-modal collaborative optimization, our method synthesizes semantically coherent and logically consistent images without requiring ground-truth text annotations. This significantly enhances the semantic expressiveness and generalization capability of small distilled datasets. Our approach achieves state-of-the-art performance across multiple benchmarks, producing images with complete key objects and semantically plausible content. The code and data are publicly available.

Technology Category

Application Category

📝 Abstract

Dataset distillation (DD) condenses large datasets into compact yet informative substitutes, preserving performance comparable to the original dataset while reducing storage, transmission costs, and computational consumption. However, previous DD methods mainly focus on distilling information from images, often overlooking the semantic information inherent in the data. The disregard for context hinders the model's generalization ability, particularly in tasks involving complex datasets, which may result in illogical outputs or the omission of critical objects. In this study, we integrate vision-language methods into DD by introducing text prototypes to distill language information and collaboratively synthesize data with image prototypes, thereby enhancing dataset distillation performance. Notably, the text prototypes utilized in this study are derived from descriptive text information generated by an open-source large language model. This framework demonstrates broad applicability across datasets without pre-existing text descriptions, expanding the potential of dataset distillation beyond traditional image-based approaches. Compared to other methods, the proposed approach generates logically coherent images containing target objects, achieving state-of-the-art validation performance and demonstrating robust generalization. Source code and generated data are available in https://github.com/zou-yawen/Dataset-Distillation-via-Vision-Language-Category-Prototype/

Problem

Research questions and friction points this paper is trying to address.

Enhance dataset distillation by integrating vision-language semantic information

Address poor generalization in complex datasets via text-image prototype collaboration

Generate logically coherent distilled datasets without pre-existing text descriptions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates vision-language methods into dataset distillation

Uses text prototypes from large language models

Enhances performance with collaborative image-text synthesis

🔎 Similar Papers

VolDoGer: LLM-assisted Datasets for Domain Generalization in Vision-Language Tasks