🤖 AI Summary
Existing graph language model (GLM) evaluation benchmarks predominantly repurpose unimodal node classification datasets, failing to rigorously assess joint graph–language reasoning; empirical results show strong performance with text-only prompts, revealing limited necessity for multimodal fusion. Method: We introduce CLEGR—the first synthetic benchmark designed for structure–semantics co-reasoning—featuring controllable graph generation and multi-level question-answering to systematically evaluate GLMs’ multimodal reasoning capabilities. Contribution/Results: Experiments reveal significant performance degradation of mainstream GLMs on structural reasoning tasks; notably, vanilla large language models (LLMs) match or surpass graph-augmented models, challenging the efficacy of current graph neural network–LLM integration paradigms. CLEGR establishes a new standard and diagnostic framework for rigorous GLM evaluation.
📝 Abstract
Developments in Graph-Language Models (GLMs) aim to integrate the structural reasoning capabilities of Graph Neural Networks (GNNs) with the semantic understanding of Large Language Models (LLMs). However, we demonstrate that current evaluation benchmarks for GLMs, which are primarily repurposed node-level classification datasets, are insufficient to assess multimodal reasoning. Our analysis reveals that strong performance on these benchmarks is achievable using unimodal information alone, suggesting that they do not necessitate graph-language integration. To address this evaluation gap, we introduce the CLEGR(Compositional Language-Graph Reasoning) benchmark, designed to evaluate multimodal reasoning at various complexity levels. Our benchmark employs a synthetic graph generation pipeline paired with questions that require joint reasoning over structure and textual semantics. We perform a thorough evaluation of representative GLM architectures and find that soft-prompted LLM baselines perform on par with GLMs that incorporate a full GNN backbone. This result calls into question the architectural necessity of incorporating graph structure into LLMs. We further show that GLMs exhibit significant performance degradation in tasks that require structural reasoning. These findings highlight limitations in the graph reasoning capabilities of current GLMs and provide a foundation for advancing the community toward explicit multimodal reasoning involving graph structure and language.