🤖 AI Summary
Current vision-language models (VLMs) heavily rely on linguistic priors and dataset biases, exhibiting limited capacity for genuine visual understanding. Method: We propose ViLP, a novel benchmark introducing the “three-images-three-answers” visual question answering (VQA) paradigm to systematically expose VLMs’ visual reasoning deficiencies by contrasting text-answerable images with those requiring genuine visual inference. We further design a generative-model-based out-of-distribution (OOD) image construction method and an iterative self-improvement training framework integrating pixel-level perturbations and semantic corruption. Contribution/Results: ViLP evaluation reveals GPT-4’s visual reasoning bottleneck, achieving only 66.17% accuracy. Our framework significantly improves visual reasoning performance of open-source VLMs—including LLaVA-v1.5 and Cambrian—demonstrating its effectiveness in enhancing robustness. ViLP establishes a new paradigm for rigorous VLM evaluation and training grounded in principled visual reasoning assessment.
📝 Abstract
Despite recent advances in Vision-Language Models (VLMs), many still over-rely on visual language priors present in their training data rather than true visual reasoning. To examine the situation, we introduce ViLP, a visual question answering (VQA) benchmark that pairs each question with three potential answers and three corresponding images: one image whose answer can be inferred from text alone, and two images that demand visual reasoning. By leveraging image generative models, we ensure significant variation in texture, shape, conceptual combinations, hallucinated elements, and proverb-based contexts, making our benchmark images distinctly out-of-distribution. While humans achieve near-perfect accuracy, modern VLMs falter; for instance, GPT-4 achieves only 66.17% on ViLP. To alleviate this, we propose a self-improving framework in which models generate new VQA pairs and images, then apply pixel-level and semantic corruptions to form"good-bad"image pairs for self-training. Our training objectives compel VLMs to focus more on actual visual inputs and have demonstrated their effectiveness in enhancing the performance of open-source VLMs, including LLaVA-v1.5 and Cambrian.