EVLF: Early Vision-Language Fusion for Generative Dataset Distillation

📅 2026-03-08

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This work addresses a key limitation in existing diffusion-based dataset distillation methods, which typically integrate text prompts at late generation stages, thereby suppressing visual features and producing synthetic samples overly reliant on textual cues at the expense of photorealism. To overcome this, the authors propose an early vision–language fusion mechanism that aligns image and text embeddings via lightweight cross-attention between the encoder and the generative backbone, jointly preserving fine-grained textures and global semantics. Notably, this approach introduces cross-modal fusion at an early stage of the diffusion process for the first time, enabling a plug-and-play, task-agnostic enhancement compatible with diverse denoising architectures and sampling strategies. Experiments demonstrate that the generated data significantly outperform prior methods in both semantic fidelity and visual coherence, consistently boosting performance across multiple downstream classification tasks.

Technology Category

Application Category

📝 Abstract

Dataset distillation (DD) aims to synthesize compact training sets that enable models to achieve high accuracy with significantly fewer samples. Recent diffusion-based DD methods commonly introduce semantic guidance through late-stage cross-attention, where textual prompts tend to dominate the generative process. Although this strategy enforces label relevance, it diminishes the contribution of visual latents, resulting in over-corrected samples that mirror prompt patterns rather than reflecting intrinsic visual features. To solve this problem, we introduce an Early Vision-Language Fusion (EVLF) method that aligns textual and visual embeddings at the transition between the encoder and the generative backbone. By incorporating a lightweight cross-attention module at this transition, the early representations simultaneously encode local textures and global semantic directions across the denoising process. Importantly, EVLF is plug-and-play and can be easily integrated into any diffusion-based dataset distillation pipeline with an encoder. It works across different denoiser architectures and sampling schedules without any task-specific modifications. Extensive experiments demonstrate that EVLF generates semantically faithful and visually coherent synthetic data, yielding consistent improvements in downstream classification accuracy across varied settings. Source code is available at https://github.com/wenqi-cai297/earlyfusion-for-dd/.

Problem

Research questions and friction points this paper is trying to address.

dataset distillation

diffusion models

vision-language fusion

textual prompts

visual latents

Innovation

Methods, ideas, or system contributions that make the work stand out.

Early Vision-Language Fusion

Dataset Distillation

Diffusion Models