🤖 AI Summary
Current vision-language models (VLMs) exhibit limited performance on low-level visual tasks such as spatial reasoning and viewpoint recognition, primarily due to the scarcity of supervisory signals in natural images. To address this, this work proposes VisionFoundry—a fully automated framework that constructs high-quality synthetic visual perception data using only task names as input. The method leverages large language models to generate questions, answers, and text-to-image prompts, synthesizes corresponding images via text-to-image models, and validates image-text consistency using a dedicated VLM—all without relying on real images or human annotations. Evaluated on MMVP and CV-Bench-3D, models trained on the resulting VisionFoundry-10K dataset achieve performance gains of 7% and 10%, respectively, while preserving general-purpose capabilities and demonstrating strong scalability.
📝 Abstract
Vision-language models (VLMs) still struggle with visual perception tasks such as spatial understanding and viewpoint recognition. One plausible contributing factor is that natural image datasets provide limited supervision for low-level visual skills. This motivates a practical question: can targeted synthetic supervision, generated from only a task keyword such as Depth Order, address these weaknesses? To investigate this question, we introduce VisionFoundry, a task-aware synthetic data generation pipeline that takes only the task name as input and uses large language models (LLMs) to generate questions, answers, and text-to-image (T2I) prompts, then synthesizes images with T2I models and verifies consistency with a proprietary VLM, requiring no reference images or human annotation. Using VisionFoundry, we construct VisionFoundry-10K, a synthetic visual question answering (VQA) dataset containing 10k image-question-answer triples spanning 10 tasks. Models trained on VisionFoundry-10K achieve substantial improvements on visual perception benchmarks: +7% on MMVP and +10% on CV-Bench-3D, while preserving broader capabilities and showing favorable scaling behavior as data size increases. Our results suggest that limited task-targeted supervision is an important contributor to this bottleneck and that synthetic supervision is a promising path toward more systematic training for VLMs.