🤖 AI Summary
Large vision-language models (LVLMs) exhibit severe deficiencies in geometric visual perception—such as shape, angle, and scale understanding—despite their strong performance on semantic and contextual tasks.
Method: We introduce VisOnlyQA, the first pure-vision geometric understanding benchmark, comprising 12 task categories (e.g., geometric figures, charts, chemical structures, and 3D shapes), manually constructed to rigorously evaluate zero-shot and fine-tuned performance of 20 state-of-the-art LVLMs (e.g., GPT-4o, Gemini 1.5 Pro).
Contribution/Results: Our evaluation reveals a substantial human–model gap in geometric accuracy, with LVLMs performing far below human baselines. Crucially, augmenting language modeling capacity or scaling training data yields only marginal gains, indicating that the core bottleneck lies not in downstream reasoning but in the fidelity of visual-to-linguistic information transfer within the vision encoder–language model interface. This work is the first to isolate and empirically characterize LVLMs’ fundamental limitations at the level of geometric visual representation, providing critical evidence for architectural and alignment mechanism improvements.
📝 Abstract
Large Vision Language Models (LVLMs) have achieved remarkable performance in various vision-language tasks. However, it is still unclear how accurately LVLMs can perceive visual information in images. In particular, the capability of LVLMs to perceive geometric information, such as shape, angle, and size, remains insufficiently analyzed, although the perception of these properties is crucial for tasks that require a detailed visual understanding. In this work, we introduce VisOnlyQA, a dataset for evaluating the geometric perception of LVLMs, and reveal that LVLMs often cannot accurately perceive basic geometric information in images, while human performance is nearly perfect. VisOnlyQA consists of 12 tasks that directly ask about geometric information in geometric shapes, charts, chemical structures, and 3D shapes. Our experiments highlight the following findings: (i) State-of-the-art LVLMs struggle with basic geometric perception -- 20 LVLMs we evaluate, including GPT-4o and Gemini 1.5 Pro, work poorly on VisOnlyQA. (ii) Additional training data does not resolve this issue -- fine-tuning on the training set of VisOnlyQA is not always effective, even for in-distribution tasks. (iii) Bottleneck in the architecture -- LVLMs using stronger LLMs exhibit better geometric perception on VisOnlyQA, while it does not require complex reasoning, suggesting that the way LVLMs process information from visual encoders is a bottleneck. The datasets, code, and model responses are provided at https://github.com/psunlpgroup/VisOnlyQA.