JourneyBench: A Challenging One-Stop Vision-Language Understanding Benchmark of Generated Images

📅 2024-09-19
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current vision-language models exhibit weak multimodal reasoning capabilities in non-photorealistic and counterfactual scenarios, being unduly influenced by linguistic priors and coarse-grained visual heuristics. Method: We introduce GenVLU—the first fine-grained vision-language understanding benchmark tailored for generated images—comprising five challenging tasks, including hallucination-triggered VQA and sample-level adversarial retrieval, all requiring multi-step cross-modal reasoning. GenVLU is built upon human fine-grained annotations and controllable image synthesis, and incorporates a fine-grained attribution-based evaluation framework. Contribution/Results: Extensive experiments reveal that state-of-the-art multimodal large models suffer substantial performance degradation across all tasks, exposing a critical gap between their superficial competence and genuine visual reasoning ability. This highlights fundamental limitations of prevailing paradigms in deep semantic alignment and counterfactual reasoning, underscoring the need for more robust, causally grounded vision-language modeling.

Technology Category

Application Category

📝 Abstract
Existing vision-language understanding benchmarks largely consist of images of objects in their usual contexts. As a consequence, recent multimodal large language models can perform well with only a shallow visual understanding by relying on background language biases. Thus, strong performance on these benchmarks does not necessarily correlate with strong visual understanding. In this paper, we release JourneyBench, a comprehensive human-annotated benchmark of generated images designed to assess the model's fine-grained multimodal reasoning abilities across five tasks: complementary multimodal chain of thought, multi-image VQA, imaginary image captioning, VQA with hallucination triggers, and fine-grained retrieval with sample-specific distractors. Unlike existing benchmarks, JourneyBench explicitly requires fine-grained multimodal reasoning in unusual imaginary scenarios where language bias and holistic image gist are insufficient. We benchmark state-of-the-art models on JourneyBench and analyze performance along a number of fine-grained dimensions. Results across all five tasks show that JourneyBench is exceptionally challenging for even the best models, indicating that models' visual reasoning abilities are not as strong as they first appear. We discuss the implications of our findings and propose avenues for further research.
Problem

Research questions and friction points this paper is trying to address.

Visual-Linguistic Integration
Complex Scene Understanding
Text Bias Mitigation
Innovation

Methods, ideas, or system contributions that make the work stand out.

JourneyBench
Multimodal Reasoning
Novel Benchmark
🔎 Similar Papers
Z
Zhecan Wang
Columbia University
J
Junzhang Liu
Columbia University
C
Chia-Wei Tang
Virginia Tech
H
Hani Alomari
Virginia Tech
A
Anushka Sivakumar
Virginia Tech
R
Rui Sun
Columbia University
W
Wenhao Li
Columbia University
M
Md. Atabuzzaman
Virginia Tech
Hammad Ayyubi
Hammad Ayyubi
Graduate Student, Columbia University
Aritficial IntelligenceMachine LearningComputer VisionNatural Language Processing
Haoxuan You
Haoxuan You
Apple AI/ML
Computer VisionDeep LearningNLP
A
A. Ishmam
Virginia Tech
K
Kai-Wei Chang
UCLA
Shih-Fu Chang
Shih-Fu Chang
Professor of Electrical Engineering and Computer Science, Columbia University
MultimediaComputer VisionMachine LearningSignal ProcessingInformation Retrieval
Chris Thomas
Chris Thomas
Virginia Tech
Computer Vision