Data Metabolism: An Efficient Data Design Schema For Vision Language Model

📅 2025-04-10
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
To address data inefficiency and model redundancy in vision-language model (VLM) training, this work proposes the “Data Metabolism” paradigm—a data-centric, full-lifecycle VLM development framework. Methodologically, it integrates multi-stage data cleaning and augmentation, task-aware data composition, user-driven data flywheel feedback loops, and lightweight architecture fine-tuning with rigorous evaluation. Its core contribution lies in elevating data governance to a dynamic, self-optimizing system that continuously evolves through data curation, iterative refinement, and customized feedback. Experiments demonstrate that our released model, Capybara-VL—parameterized at only 1/10 the scale of leading closed-source VLMs—achieves competitive performance on visual question answering, scientific reasoning, and text-intensive tasks. This significantly improves training efficiency and deployment feasibility without sacrificing capability.

Technology Category

Application Category

📝 Abstract
Data curation plays a crucial role in training powerful Visual Language Models (VLMs). In this work, we introduce the concept of Data Metabolism and present our data-centric framework to build VLMs throughout the development lifecycle. Starting from a standard model architecture, we discuss and provide insights into two crucial development steps: data curation and iteration, forming a closed-loop system that continuously improves model performance. We show a detailed codebook on how to process existing massive datasets and build user-specific data flywheel. As a demonstration, we release a VLM, named Capybara-VL, which excels in typical multimodal tasks (e.g. , visual question answering, scientific reasoning, and text-rich tasks). Despite its relatively compact size, Capybara-VL surpasses several open-source models that are up to 10 times larger in size. Moreover, it achieves results that are on par with those of several leading proprietary models, demonstrating its remarkable competitiveness. These results highlight the power of our data-centric framework and the potential of training smaller and more efficient VLMs.
Problem

Research questions and friction points this paper is trying to address.

Improving VLM performance through data curation and iteration
Designing efficient data processing for smaller, competitive VLMs
Building a closed-loop system for continuous model enhancement
Innovation

Methods, ideas, or system contributions that make the work stand out.

Data Metabolism framework for VLM lifecycle
Closed-loop data curation and iteration system
User-specific data flywheel construction method
🔎 Similar Papers
No similar papers found.
J
Jingyuan Zhang
Kuaishou Technology
Hongzhi Zhang
Hongzhi Zhang
Professor of Computer Science and Technology, Harbin Institute of Technology
Deep LearningArtificial IntelligenceComputer Vision
Haonan Zhou
Haonan Zhou
HKU Business School
C
Chenxi Sun
Kuaishou Technology
X
Xingguang Ji
Kuaishou Technology
J
Jiakang Wang
Kuaishou Technology
Fanheng Kong
Fanheng Kong
Northeastern University; Kuaishou Technology
Multimodal LLMMultimodal Understanding
Y
Yahui Liu
Kuaishou Technology
Q
Qi Wang
Kuaishou Technology
F
Fuzheng Zhang
Kuaishou Technology