🤖 AI Summary
Current vision-language models (VLMs) rely heavily on large-scale real image-text pairs, encountering bottlenecks including inefficient data acquisition, inconsistent quality, and privacy risks. To address this, we propose SynthVLM, the first framework adopting a “text-to-image reverse synthesis” paradigm: leveraging advanced diffusion models (e.g., SDXL), it generates high-fidelity, semantically aligned synthetic image-text pairs. We construct SynthVLM-100K—the first 100K-scale benchmark rigorously validated by both human annotators and automated models—and design a hybrid synthetic data distillation pipeline integrating multi-stage automated filtering with human verification, alongside an end-to-end multimodal large language model (MLLM) pretraining framework. Experiments demonstrate that SynthVLM-100K outperforms comparable real-world datasets on VQA; its derived models, SynthVLM-7B/13B, surpass LLaVA using only 82% of its pretraining data and achieve state-of-the-art performance on MMLU, confirming that high-quality synthetic data effectively preserves linguistic understanding and cross-modal generalization capabilities.
📝 Abstract
Vision-Language Models (VLMs) have recently emerged, demonstrating remarkable vision-understanding capabilities. However, training these models requires large-scale datasets, which brings challenges related to efficiency, effectiveness, quality, and privacy of web data. In this paper, we introduce SynthVLM, a novel data synthesis and curation method for generating image-caption pairs. Unlike traditional methods, where captions are generated from images, SynthVLM utilizes advanced diffusion models and high-quality captions to automatically synthesize and select high-resolution images from text descriptions, thereby creating precisely aligned image-text pairs. To demonstrate the power of SynthVLM, we introduce SynthVLM-100K, a high-quality dataset consisting of 100,000 curated and synthesized image-caption pairs. In both model and human evaluations, SynthVLM-100K outperforms traditional real-world datasets. Leveraging this dataset, we develop a new family of multimodal large language models (MLLMs), SynthVLM-7B and SynthVLM-13B, which achieve state-of-the-art (SOTA) performance on various vision question-answering (VQA) tasks. Notably, our models outperform LLaVA across most metrics with only 18% pretrain data. Furthermore, SynthVLM-7B and SynthVLM-13B attain SOTA performance on the MMLU benchmark, demonstrating that the high-quality SynthVLM-100K dataset preserves language abilities. To facilitate future research, our dataset and the complete data generating and curating methods are open-sourced at https://github.com/starriver030515/SynthVLM.