VisionFoundry: Teaching VLMs Visual Perception with Synthetic Images

📅 2026-04-10
📈 Citations: 0
Influential: 0
📄 PDF

career value

190K/year
🤖 AI Summary
Current vision-language models (VLMs) exhibit limited performance on low-level visual tasks such as spatial reasoning and viewpoint recognition, primarily due to the scarcity of supervisory signals in natural images. To address this, this work proposes VisionFoundry—a fully automated framework that constructs high-quality synthetic visual perception data using only task names as input. The method leverages large language models to generate questions, answers, and text-to-image prompts, synthesizes corresponding images via text-to-image models, and validates image-text consistency using a dedicated VLM—all without relying on real images or human annotations. Evaluated on MMVP and CV-Bench-3D, models trained on the resulting VisionFoundry-10K dataset achieve performance gains of 7% and 10%, respectively, while preserving general-purpose capabilities and demonstrating strong scalability.

Technology Category

Application Category

📝 Abstract
Vision-language models (VLMs) still struggle with visual perception tasks such as spatial understanding and viewpoint recognition. One plausible contributing factor is that natural image datasets provide limited supervision for low-level visual skills. This motivates a practical question: can targeted synthetic supervision, generated from only a task keyword such as Depth Order, address these weaknesses? To investigate this question, we introduce VisionFoundry, a task-aware synthetic data generation pipeline that takes only the task name as input and uses large language models (LLMs) to generate questions, answers, and text-to-image (T2I) prompts, then synthesizes images with T2I models and verifies consistency with a proprietary VLM, requiring no reference images or human annotation. Using VisionFoundry, we construct VisionFoundry-10K, a synthetic visual question answering (VQA) dataset containing 10k image-question-answer triples spanning 10 tasks. Models trained on VisionFoundry-10K achieve substantial improvements on visual perception benchmarks: +7% on MMVP and +10% on CV-Bench-3D, while preserving broader capabilities and showing favorable scaling behavior as data size increases. Our results suggest that limited task-targeted supervision is an important contributor to this bottleneck and that synthetic supervision is a promising path toward more systematic training for VLMs.
Problem

Research questions and friction points this paper is trying to address.

visual perception
vision-language models
synthetic supervision
spatial understanding
viewpoint recognition
Innovation

Methods, ideas, or system contributions that make the work stand out.

synthetic data generation
vision-language models
visual perception
task-aware prompting
text-to-image synthesis