Leveraging Multi-Modal Information to Enhance Dataset Distillation

📅 2025-05-13

📈 Citations: 0

✨ Influential: 0

career value

150K/year

🤖 AI Summary

This work addresses two key challenges in dataset distillation: low semantic fidelity and coarse-grained object representation. To this end, we propose a multimodal object-level distillation framework. Methodologically, we introduce caption-guided supervision—jointly encoding CLIP text embeddings and image features via concatenation and enforcing semantic alignment through a dedicated loss—and design an object-centric masking mechanism that leverages instance segmentation masks to guide feature alignment and gradient matching. The framework enables fine-grained, multimodal co-optimization of distilled data. Evaluated on CIFAR-10/100 and Tiny-ImageNet, it achieves an average 4.2% improvement in downstream task accuracy, accelerates few-shot training convergence by 37%, and significantly enhances the compactness, representativeness, and semantic consistency of synthetic data.

Technology Category

Application Category

📝 Abstract

Dataset distillation aims to create a compact and highly representative synthetic dataset that preserves the knowledge of a larger real dataset. While existing methods primarily focus on optimizing visual representations, incorporating additional modalities and refining object-level information can significantly improve the quality of distilled datasets. In this work, we introduce two key enhancements to dataset distillation: caption-guided supervision and object-centric masking. To integrate textual information, we propose two strategies for leveraging caption features: the feature concatenation, where caption embeddings are fused with visual features at the classification stage, and caption matching, which introduces a caption-based alignment loss during training to ensure semantic coherence between real and synthetic data. Additionally, we apply segmentation masks to isolate target objects and remove background distractions, introducing two loss functions designed for object-centric learning: masked feature alignment loss and masked gradient matching loss. Comprehensive evaluations demonstrate that integrating caption-based guidance and object-centric masking enhances dataset distillation, leading to synthetic datasets that achieve superior performance on downstream tasks.

Problem

Research questions and friction points this paper is trying to address.

Enhancing dataset distillation with multi-modal information

Improving distilled datasets via caption-guided supervision

Refining object-level info with masking for better distillation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Caption-guided supervision enhances dataset distillation

Object-centric masking improves synthetic dataset quality

Multi-modal fusion boosts semantic coherence in distillation

🔎 Similar Papers

No similar papers found.