20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

This work investigates the ultimate impact of data curation on visual-language model performance under fixed model architectures, training protocols, and computational budgets. Building upon the single-image subset of MAmmoTH-VL, we develop an efficient data curation pipeline and validate its effectiveness on 1B–4B scale models using the high-fidelity DatBench evaluation suite. By optimizing only the training data, our 2B model achieves performance comparable to Qwen3-VL-2B with approximately 1/87th of the training compute, yielding an average gain of 11.7 percentage points across 20 public benchmarks. The curated model further demonstrates lower inference FLOPs, stronger out-of-distribution generalization, more reliable generation behavior, and more honest, concise responses, thereby attaining a Pareto improvement in both performance and efficiency.

📝 Abstract

Data curation has shifted the quality-compute frontier for language-model and contrastive image-text pretraining, but its role for vision-language models (VLMs) is far less established. We ask how far data curation alone can take VLM performance, holding architecture, training recipe, and compute fixed and varying only the training data. Our pipeline, applied to the MAmmoTH-VL single-image subset, lifts performance by +11.7pp on average across 20 public VLM benchmarks (spanning grounding, VQA, OCR/documents, captioning, spatial/3D, counting, charts, math, brand-ID, and multi-image reasoning) and by +11.3pp on average across all nine capability axes of DatBench, our high-fidelity VLM eval suite. At 2B, our curated model surpasses InternVL3.5-2B by 9.9pp at ~17x less training compute and closes the gap to Qwen3-VL-2B to within 1.8pp at ~87x less compute, from pretraining alone. Beyond accuracy, curation delivers four further properties: (1) Reliability: per-capability std across training seeds drops by ~67% and the lift survives a 4k-to-16k context-length sweep; (2) OOD generalization: the 9-eval OOD average rises by +7.2pp, and multi-image BLINK rises by +3.09pp despite single-image-only training, with Visual Correspondence gaining +11.8pp; (3) Behavioral gains beyond benchmarks: across ~1,100 open-ended queries the curated 2B is more honest and more specific than the matched-compute baseline, and more concise and less refusal-prone than a frontier 2B reference; (4) Pareto-dominance on inference cost: at every scale (1B, 2B, 4B) the curated model raises accuracy while lowering response FLOPs vs. the matched-compute baseline, and the curated 4B matches near-frontier accuracy at 3.3x lower response FLOPs than Qwen3-VL-4B. Data curation is a high-leverage tool for building better VLMs, reaching near-frontier accuracy at up to ~150x less training compute.

Problem

Research questions and friction points this paper is trying to address.

vision-language models

data curation

model performance

training data

benchmark evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

data curation

vision-language models

compute efficiency