FineVision: Open Data Is All You Need

📅 2025-10-20

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Visual language models (VLMs) suffer from data fragmentation, inconsistency, and contamination inherent in publicly available vision-language datasets. To address these challenges, we introduce FineVision—a large-scale, high-quality vision-language dataset comprising 24 million samples drawn from over 200 diverse sources. FineVision pioneers a human-in-the-loop data curation framework that enables cross-source consistency verification, secure annotation consumption, and authenticity validation of GUI action trajectories. It is the first open dataset to integrate an executable GUI action space and features a fully automated, end-to-end ingestion pipeline—including automated ingestion, human review, deduplication, contamination removal, and continuous quality monitoring. Models trained on FineVision achieve significant performance gains over state-of-the-art open-source hybrid datasets across multiple benchmarks. Empirical results demonstrate that data scale, cleanliness, and structured human-AI collaboration are critical determinants of VLM capability.

Technology Category

Application Category

📝 Abstract

The advancement of vision-language models (VLMs) is hampered by a fragmented landscape of inconsistent and contaminated public datasets. We introduce FineVision, a meticulously collected, curated, and unified corpus of 24 million samples - the largest open resource of its kind. We unify more than 200 sources into 185 subsets via a semi-automated, human-in-the-loop pipeline: automation performs bulk ingestion and schema mapping, while reviewers audit mappings and spot-check outputs to verify faithful consumption of annotations, appropriate formatting and diversity, and safety; issues trigger targeted fixes and re-runs. The workflow further applies rigorous de-duplication within and across sources and decontamination against 66 public benchmarks. FineVision also encompasses agentic/GUI tasks with a unified action space; reviewers validate schemas and inspect a sample of trajectories to confirm executable fidelity. Models trained on FineVision consistently outperform those trained on existing open mixtures across a broad evaluation suite, underscoring the benefits of scale, data hygiene, and balanced automation with human oversight. We release the corpus and curation tools to accelerate data-centric VLM research.

Problem

Research questions and friction points this paper is trying to address.

Address fragmented vision-language datasets with contamination issues

Unify diverse data sources through semi-automated curation pipeline

Enhance model performance via scaled hygienic data with human oversight

Innovation

Methods, ideas, or system contributions that make the work stand out.

Semi-automated pipeline unifies 200 sources

Human reviewers audit mappings and verify outputs

Rigorous deduplication and decontamination against benchmarks

🔎 Similar Papers

Ethical Challenges in Computer Vision: Ensuring Privacy and Mitigating Bias in Publicly Available Datasets