MaD-Mix: Multi-Modal Data Mixtures via Latent Space Coupling for Vision-Language Model Training

๐Ÿ“… 2026-02-08
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the high cost, manual tuning, and limited scalability of multimodal data mixing in vision-language model training by proposing MaD-Mix, a framework that formulates data mixing as a modality-aware domain alignment maximization problem. By introducing cross-modal coupling variables and leveraging Fenchel duality to derive a closed-form multimodal alignment score, MaD-Mix enables fully automatic mixing strategies without human intervention. Notably, it is the first method to uniformly support both missing-modality scenarios (e.g., text-only) and complex trimodal settings (videoโ€“imageโ€“text). Experiments demonstrate that MaD-Mix achieves performance on par with manually tuned baselines using only 78% of the training steps on both 0.5B and 7B models, significantly outperforms uniform mixing in trimodal tasks, and incurs less than one GPU-hour of computational overhead.

Technology Category

Application Category

๐Ÿ“ Abstract
Vision-Language Models (VLMs) are typically trained on a diverse set of multi-modal domains, yet current practices rely on costly manual tuning. We propose MaD-Mix, a principled and computationally efficient framework that derives multi-modal data mixtures for VLM training. MaD-Mix formulates data mixing as modality-aware domain alignment maximization and obtains closed-form multi-modal alignment scores from the Fenchel dual through inter-modal coupling variables. MaD-Mix systematically handles domains with missing modalities, allowing for the integration of language-only domains. Empirical evaluations across 0.5B and 7B models demonstrate that MaD-Mix accelerates VLM training across diverse benchmarks. MaD-Mix matches human-tuned data mixtures using 22% fewer training steps in image-text instruction tuning. In complex tri-modal video-image-text scenarios, where manual tuning becomes impractical, MaD-Mix boosts average accuracy over uniform weights, with negligible mixture computation overhead (<1 GPU-hour), enabling scalable mixture design for modern VLM pipelines.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language Models
Multi-Modal Data Mixtures
Data Mixing
Modality Alignment
Manual Tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-modal data mixing
latent space coupling
modality-aware alignment
vision-language models
Fenchel dual optimization
๐Ÿ”Ž Similar Papers
No similar papers found.