JourneyBench: A Challenging One-Stop Vision-Language Understanding Benchmark of Generated Images

📅 2024-09-19

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Current vision-language models exhibit weak multimodal reasoning capabilities in non-photorealistic and counterfactual scenarios, being unduly influenced by linguistic priors and coarse-grained visual heuristics. Method: We introduce GenVLU—the first fine-grained vision-language understanding benchmark tailored for generated images—comprising five challenging tasks, including hallucination-triggered VQA and sample-level adversarial retrieval, all requiring multi-step cross-modal reasoning. GenVLU is built upon human fine-grained annotations and controllable image synthesis, and incorporates a fine-grained attribution-based evaluation framework. Contribution/Results: Extensive experiments reveal that state-of-the-art multimodal large models suffer substantial performance degradation across all tasks, exposing a critical gap between their superficial competence and genuine visual reasoning ability. This highlights fundamental limitations of prevailing paradigms in deep semantic alignment and counterfactual reasoning, underscoring the need for more robust, causally grounded vision-language modeling.

Technology Category

Application Category

📝 Abstract

Existing vision-language understanding benchmarks largely consist of images of objects in their usual contexts. As a consequence, recent multimodal large language models can perform well with only a shallow visual understanding by relying on background language biases. Thus, strong performance on these benchmarks does not necessarily correlate with strong visual understanding. In this paper, we release JourneyBench, a comprehensive human-annotated benchmark of generated images designed to assess the model's fine-grained multimodal reasoning abilities across five tasks: complementary multimodal chain of thought, multi-image VQA, imaginary image captioning, VQA with hallucination triggers, and fine-grained retrieval with sample-specific distractors. Unlike existing benchmarks, JourneyBench explicitly requires fine-grained multimodal reasoning in unusual imaginary scenarios where language bias and holistic image gist are insufficient. We benchmark state-of-the-art models on JourneyBench and analyze performance along a number of fine-grained dimensions. Results across all five tasks show that JourneyBench is exceptionally challenging for even the best models, indicating that models' visual reasoning abilities are not as strong as they first appear. We discuss the implications of our findings and propose avenues for further research.

Problem

Research questions and friction points this paper is trying to address.

Visual-Linguistic Integration

Complex Scene Understanding

Text Bias Mitigation

Innovation

Methods, ideas, or system contributions that make the work stand out.

JourneyBench

Multimodal Reasoning

Novel Benchmark

🔎 Similar Papers

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling