Multimodal Distribution Matching for Vision-Language Dataset Distillation

📅 2026-05-22

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This work addresses the high computational cost and difficulty in preserving cross-modal alignment and joint distribution inherent in existing vision-language dataset distillation methods. The authors propose a geometry-aware multimodal distribution matching framework that jointly optimizes at the data, model, and loss levels. Specifically, synthetic samples are initialized via clustering in a shared embedding space, mixed supervision signals are generated through teacher model weight interpolation, and a symmetric contrastive objective leveraging the geometric structure of the unit hypersphere is introduced to align directional features across modalities. This approach substantially reduces distillation overhead while producing compact, semantically faithful, and architecture-agnostic synthetic datasets that effectively retain the original multimodal semantics and alignment performance across multiple text–image retrieval benchmarks.

📝 Abstract

Dataset distillation compresses large training sets into compact synthetic datasets while preserving downstream performance. As modern systems increasingly operate on paired vision-language inputs, multimodal distillation must preserve representation quality and cross-modal alignment under tight compute and memory budgets, yet prior methods often require heavy computes and overlook their correlations. To address this, we present Multimodal Distribution Matching (MDM), a geometry-aware framework for efficient and generalizable multimodal distillation. Specifically, MDM integrates complementary components at the data, model, and loss levels. At the data level, it initializes synthetic image-text pairs by sampling from clusters in the joint embedding space. At the model level, it forms a mixed teacher by interpolating independently fine-tuned models in weight space according to their angular deviation from the pretrained anchor. At the loss level, it matches joint distributions on the unit hypersphere using a geometry-aware matching objective that exploits the joint features in the cross-modal agreement and discrepancy directions along with symmetric contrastive learning. Across image-text retrieval benchmarks with cross-architecture evaluation, MDM yields compact synthetic sets that preserve multimodal semantics, substantially reduce distillation cost, and remain robust across architectures.

Problem

Research questions and friction points this paper is trying to address.

multimodal distillation

vision-language

dataset distillation

cross-modal alignment

distribution matching

Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal distillation

geometry-aware matching

synthetic dataset