VisOnlyQA: Large Vision Language Models Still Struggle with Visual Perception of Geometric Information

📅 2024-12-01

🏛️ arXiv.org

📈 Citations: 5

✨ Influential: 1

career value

195K/year

🤖 AI Summary

Large vision-language models (LVLMs) exhibit severe deficiencies in geometric visual perception—such as shape, angle, and scale understanding—despite their strong performance on semantic and contextual tasks. Method: We introduce VisOnlyQA, the first pure-vision geometric understanding benchmark, comprising 12 task categories (e.g., geometric figures, charts, chemical structures, and 3D shapes), manually constructed to rigorously evaluate zero-shot and fine-tuned performance of 20 state-of-the-art LVLMs (e.g., GPT-4o, Gemini 1.5 Pro). Contribution/Results: Our evaluation reveals a substantial human–model gap in geometric accuracy, with LVLMs performing far below human baselines. Crucially, augmenting language modeling capacity or scaling training data yields only marginal gains, indicating that the core bottleneck lies not in downstream reasoning but in the fidelity of visual-to-linguistic information transfer within the vision encoder–language model interface. This work is the first to isolate and empirically characterize LVLMs’ fundamental limitations at the level of geometric visual representation, providing critical evidence for architectural and alignment mechanism improvements.

Technology Category

Application Category

📝 Abstract

Large Vision Language Models (LVLMs) have achieved remarkable performance in various vision-language tasks. However, it is still unclear how accurately LVLMs can perceive visual information in images. In particular, the capability of LVLMs to perceive geometric information, such as shape, angle, and size, remains insufficiently analyzed, although the perception of these properties is crucial for tasks that require a detailed visual understanding. In this work, we introduce VisOnlyQA, a dataset for evaluating the geometric perception of LVLMs, and reveal that LVLMs often cannot accurately perceive basic geometric information in images, while human performance is nearly perfect. VisOnlyQA consists of 12 tasks that directly ask about geometric information in geometric shapes, charts, chemical structures, and 3D shapes. Our experiments highlight the following findings: (i) State-of-the-art LVLMs struggle with basic geometric perception -- 20 LVLMs we evaluate, including GPT-4o and Gemini 1.5 Pro, work poorly on VisOnlyQA. (ii) Additional training data does not resolve this issue -- fine-tuning on the training set of VisOnlyQA is not always effective, even for in-distribution tasks. (iii) Bottleneck in the architecture -- LVLMs using stronger LLMs exhibit better geometric perception on VisOnlyQA, while it does not require complex reasoning, suggesting that the way LVLMs process information from visual encoders is a bottleneck. The datasets, code, and model responses are provided at https://github.com/psunlpgroup/VisOnlyQA.

Problem

Research questions and friction points this paper is trying to address.

Assessing LVLMs' accuracy in perceiving visual geometric information

Evaluating LVLMs' capability to understand shapes, angles, and sizes

Identifying architectural bottlenecks in LVLMs' visual information processing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces VisOnlyQA dataset for geometric perception

Reveals LVLMs struggle with basic geometric information

Identifies bottleneck in LVLMs' visual encoder processing

🔎 Similar Papers

Understanding Depth and Height Perception in Large Visual-Language Models