Dynamic Cluster Data Sampling for Efficient and Long-Tail-Aware Vision-Language Pre-training

📅 2026-04-30

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Existing vision-language models suffer from insufficient representation of long-tail semantics due to data sampling biases toward head concepts during pretraining. To address this, this work proposes Dynamic Clustering Sampling (DynamiCS), a novel approach that dynamically adjusts the sampling strategy in each training epoch based on semantic cluster sizes—downsampling large clusters and upsampling small ones. This method enhances long-tail learning while preserving the overall semantic structure. DynamiCS is the first to introduce a dynamic, cluster-aware resampling mechanism into vision-language pretraining, thereby overcoming the limitations of conventional assumptions of flat data distributions. It achieves superior performance on multiple long-tail benchmarks compared to existing methods, all while significantly reducing computational overhead.

📝 Abstract

The computational cost of training a vision-language model (VLM) can be reduced by sampling the training data. Previous work on efficient VLM pre-training has pointed to the importance of semantic data balance, adjusting the distribution of topics in the data to improve VLM accuracy. However, existing efficient pre-training approaches may disproportionately remove rare concepts from the training corpus. As a result, \emph{long-tail concepts} remain insufficiently represented in the training data and are not effectively captured during training. In this work, we introduce a \emph{dynamic cluster-based sampling approach (DynamiCS)} that downsamples large clusters of data and upsamples small ones. The approach is dynamic in that it applies sampling at each epoch. We first show the importance of dynamic sampling for VLM training. Then, we demonstrate the advantage of our cluster-scaling approach, which maintains the relative order of semantic clusters in the data and emphasizes the long-tail. This approach contrasts with current work, which focuses only on flattening the semantic distribution of the data. Our experiments show that DynamiCS reduces the computational cost of VLM training and provides a performance advantage for long-tail concepts.

Problem

Research questions and friction points this paper is trying to address.

vision-language pre-training

long-tail concepts

data sampling

semantic balance

computational efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

dynamic sampling

cluster-based sampling

long-tail learning