Benchmarking and Improving Large Vision-Language Models for Fundamental Visual Graph Understanding and Reasoning

📅 2024-12-18
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large vision-language models (LVLMs) exhibit limited performance in fundamental visual graph understanding and structured reasoning tasks. Method: We introduce VGCure—the first comprehensive benchmark for visual graph understanding, comprising 22 fine-grained tasks—and propose a structure-aware self-supervised fine-tuning framework. This framework explicitly models nodes, edges, and hierarchical relationships via three complementary objectives: graph topology reconstruction, relational contrastive learning, and structure-aware prompting. It integrates multimodal vision-language modeling with multi-granularity evaluation. Contribution/Results: Experiments demonstrate an average 18.7% improvement on VGCure and a 12.3% gain on downstream graph-related tasks, significantly enhancing LVLMs’ structural awareness and robust reasoning capabilities over complex visual graphs.

Technology Category

Application Category

📝 Abstract
Large Vision-Language Models (LVLMs) have demonstrated remarkable performance across diverse tasks. Despite great success, recent studies show that LVLMs encounter substantial limitations when engaging with visual graphs. To study the reason behind these limitations, we propose VGCure, a comprehensive benchmark covering 22 tasks for examining the fundamental graph understanding and reasoning capacities of LVLMs. Extensive evaluations conducted on 14 LVLMs reveal that LVLMs are weak in basic graph understanding and reasoning tasks, particularly those concerning relational or structurally complex information. Based on this observation, we propose a structure-aware fine-tuning framework to enhance LVLMs with structure learning abilities through three self-supervised learning tasks. Experiments validate the effectiveness of our method in improving LVLMs' performance on fundamental and downstream graph learning tasks, as well as enhancing their robustness against complex visual graphs.
Problem

Research questions and friction points this paper is trying to address.

Enhance LVLMs' graph understanding
Address LVLMs' visual graph limitations
Improve LVLMs' reasoning on complex graphs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Develops VGCure benchmark for LVLMs
Introduces structure-aware fine-tuning framework
Enhances LVLMs with self-supervised tasks
🔎 Similar Papers
No similar papers found.