FineVision: Open Data Is All You Need

📅 2025-10-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Visual language models (VLMs) suffer from data fragmentation, inconsistency, and contamination inherent in publicly available vision-language datasets. To address these challenges, we introduce FineVision—a large-scale, high-quality vision-language dataset comprising 24 million samples drawn from over 200 diverse sources. FineVision pioneers a human-in-the-loop data curation framework that enables cross-source consistency verification, secure annotation consumption, and authenticity validation of GUI action trajectories. It is the first open dataset to integrate an executable GUI action space and features a fully automated, end-to-end ingestion pipeline—including automated ingestion, human review, deduplication, contamination removal, and continuous quality monitoring. Models trained on FineVision achieve significant performance gains over state-of-the-art open-source hybrid datasets across multiple benchmarks. Empirical results demonstrate that data scale, cleanliness, and structured human-AI collaboration are critical determinants of VLM capability.

Technology Category

Application Category

📝 Abstract
The advancement of vision-language models (VLMs) is hampered by a fragmented landscape of inconsistent and contaminated public datasets. We introduce FineVision, a meticulously collected, curated, and unified corpus of 24 million samples - the largest open resource of its kind. We unify more than 200 sources into 185 subsets via a semi-automated, human-in-the-loop pipeline: automation performs bulk ingestion and schema mapping, while reviewers audit mappings and spot-check outputs to verify faithful consumption of annotations, appropriate formatting and diversity, and safety; issues trigger targeted fixes and re-runs. The workflow further applies rigorous de-duplication within and across sources and decontamination against 66 public benchmarks. FineVision also encompasses agentic/GUI tasks with a unified action space; reviewers validate schemas and inspect a sample of trajectories to confirm executable fidelity. Models trained on FineVision consistently outperform those trained on existing open mixtures across a broad evaluation suite, underscoring the benefits of scale, data hygiene, and balanced automation with human oversight. We release the corpus and curation tools to accelerate data-centric VLM research.
Problem

Research questions and friction points this paper is trying to address.

Address fragmented vision-language datasets with contamination issues
Unify diverse data sources through semi-automated curation pipeline
Enhance model performance via scaled hygienic data with human oversight
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semi-automated pipeline unifies 200 sources
Human reviewers audit mappings and verify outputs
Rigorous deduplication and decontamination against benchmarks
🔎 Similar Papers
No similar papers found.
L
Luis Wiedmann
Hugging Face, Technical University Munich
Orr Zohar
Orr Zohar
Stanford University
Large Multimodal ModelsFoundation ModelsVision-Language Models
A
Amir Mahla
Stanford University
X
Xiaohan Wang
Hugging Face
R
Rui Li
Hugging Face
T
Thibaud Frere
Hugging Face
Leandro von Werra
Leandro von Werra
Hugging Face
A
Aritra Roy Gosthipaty
Hugging Face
Andrés Marafioti
Andrés Marafioti
Hugging Face
machine learningaudio generationcomputer vision