Concept-Aware Batch Sampling Improves Language-Image Pretraining

📅 2025-11-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing data filtering methods for vision-language models suffer from two key limitations: offline static selection and concept-agnostic sampling—leading to dataset bias and poor task adaptability. To address this, we propose Concept-Aware Batch Sampling (CABS), the first online, task-adaptive framework that dynamically constructs training batches aligned with target concept distributions. CABS introduces DataConcept—a large-scale, fine-grained concept-annotated dataset of 128M image-text pairs—enabling concept-aware, on-the-fly batch construction. It integrates two complementary sampling strategies: diversity maximization and frequency maximization, ensuring flexible, controllable, and low-bias online data selection. Extensive evaluation across 28 downstream benchmarks demonstrates consistent and significant performance gains for CLIP- and SigLIP-based models, validating both effectiveness and generalizability. All code, DataConcept, and pre-trained models are publicly released, providing a high-quality, customizable alternative for vision-language pretraining.

Technology Category

Application Category

📝 Abstract
What data should a vision-language model be trained on? To answer this question, many data curation efforts center on the quality of a dataset. However, most of these existing methods are (i) offline, i.e. they produce a static dataset from a set of predetermined filtering criteria, and (ii) concept-agnostic, i.e. they use model-based filters which induce additional data biases. In this work, we go beyond such offline, concept-agnostic methods and advocate for more flexible, task-adaptive online concept-based curation. Our first contribution is DataConcept, a collection of 128M web-crawled image-text pairs annotated with fine-grained details about their concept composition. Building on DataConcept, we introduce Concept-Aware Batch Sampling (CABS), a simple yet effective batch sampling framework that flexibly constructs batches on-the-fly based on specific target distributions. We propose two variants: (i) Diversity Maximization (CABS-DM) to curate batches with a broad coverage of available concepts, and (ii) Frequency Maximization (CABS-FM) to curate batches with high object multiplicity. Through extensive evaluations across 28 benchmarks, we demonstrate that our CABS method significantly benefits CLIP/SigLIP model classes and yields highly performant models. Overall, CABS represents a strong open-source alternative to proprietary online data curation algorithms, enabling practitioners to define custom concept distributions that optimize for specific downstream tasks.
Problem

Research questions and friction points this paper is trying to address.

Existing vision-language training uses offline concept-agnostic data curation methods
Current approaches create static datasets with predetermined filtering criteria
Model-based filters in existing methods introduce additional data biases
Innovation

Methods, ideas, or system contributions that make the work stand out.

Online concept-based batch sampling framework
Dynamic curation using target concept distributions
Maximizes diversity and frequency in training batches
🔎 Similar Papers
2024-04-30International Conference on Machine LearningCitations: 13