Unicorn: Text-Only Data Synthesis for Vision Language Model Training

📅 2025-03-28

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

Existing vision-language model (VLM) training relies heavily on large-scale, manually collected image-text pairs, resulting in high acquisition costs and poor scalability. To address this, this paper proposes the first purely text-driven, three-stage multimodal data synthesis framework: starting from sparse textual seeds, it employs LLM-guided caption expansion, iterative instruction construction, and cross-modal representation transfer—from textual to visual embeddings—to automatically generate 1.2 million high-quality image-text pairs (Unicorn-1.2M) and 471K multi-turn instruction-tuning samples (Unicorn-471K-Instruction). Crucially, this approach eliminates the need for real images while achieving high diversity and fidelity in synthetic data. Evaluated across multiple benchmarks, models trained exclusively on our synthetic data match or approach the performance of those trained on real-image datasets. This work establishes a cost-effective, scalable paradigm for VLM data curation, significantly reducing reliance on expensive image collection and manual annotation.

Technology Category

Application Category

📝 Abstract

Training vision-language models (VLMs) typically requires large-scale, high-quality image-text pairs, but collecting or synthesizing such data is costly. In contrast, text data is abundant and inexpensive, prompting the question: can high-quality multimodal training data be synthesized purely from text? To tackle this, we propose a cross-integrated three-stage multimodal data synthesis framework, which generates two datasets: Unicorn-1.2M and Unicorn-471K-Instruction. In Stage 1: Diverse Caption Data Synthesis, we construct 1.2M semantically diverse high-quality captions by expanding sparse caption seeds using large language models (LLMs). In Stage 2: Instruction-Tuning Data Generation, we further process 471K captions into multi-turn instruction-tuning tasks to support complex reasoning. Finally, in Stage 3: Modality Representation Transfer, these textual captions representations are transformed into visual representations, resulting in diverse synthetic image representations. This three-stage process enables us to construct Unicorn-1.2M for pretraining and Unicorn-471K-Instruction for instruction-tuning, without relying on real images. By eliminating the dependency on real images while maintaining data quality and diversity, our framework offers a cost-effective and scalable solution for VLMs training. Code is available at https://github.com/Yu-xm/Unicorn.git.

Problem

Research questions and friction points this paper is trying to address.

Synthesizing multimodal training data from text-only sources

Generating diverse image-text pairs without real images

Enabling cost-effective vision-language model training

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-expanded diverse captions synthesis

Multi-turn instruction-tuning tasks generation

Text-to-visual representation transfer

🔎 Similar Papers

Better Language Models Exhibit Higher Visual Alignment