A Graph Talks, But Who's Listening? Rethinking Evaluations for Graph-Language Models

📅 2025-08-28

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Existing graph language model (GLM) evaluation benchmarks predominantly repurpose unimodal node classification datasets, failing to rigorously assess joint graph–language reasoning; empirical results show strong performance with text-only prompts, revealing limited necessity for multimodal fusion. Method: We introduce CLEGR—the first synthetic benchmark designed for structure–semantics co-reasoning—featuring controllable graph generation and multi-level question-answering to systematically evaluate GLMs’ multimodal reasoning capabilities. Contribution/Results: Experiments reveal significant performance degradation of mainstream GLMs on structural reasoning tasks; notably, vanilla large language models (LLMs) match or surpass graph-augmented models, challenging the efficacy of current graph neural network–LLM integration paradigms. CLEGR establishes a new standard and diagnostic framework for rigorous GLM evaluation.

Technology Category

Application Category

📝 Abstract

Developments in Graph-Language Models (GLMs) aim to integrate the structural reasoning capabilities of Graph Neural Networks (GNNs) with the semantic understanding of Large Language Models (LLMs). However, we demonstrate that current evaluation benchmarks for GLMs, which are primarily repurposed node-level classification datasets, are insufficient to assess multimodal reasoning. Our analysis reveals that strong performance on these benchmarks is achievable using unimodal information alone, suggesting that they do not necessitate graph-language integration. To address this evaluation gap, we introduce the CLEGR(Compositional Language-Graph Reasoning) benchmark, designed to evaluate multimodal reasoning at various complexity levels. Our benchmark employs a synthetic graph generation pipeline paired with questions that require joint reasoning over structure and textual semantics. We perform a thorough evaluation of representative GLM architectures and find that soft-prompted LLM baselines perform on par with GLMs that incorporate a full GNN backbone. This result calls into question the architectural necessity of incorporating graph structure into LLMs. We further show that GLMs exhibit significant performance degradation in tasks that require structural reasoning. These findings highlight limitations in the graph reasoning capabilities of current GLMs and provide a foundation for advancing the community toward explicit multimodal reasoning involving graph structure and language.

Problem

Research questions and friction points this paper is trying to address.

Current benchmarks fail to assess true multimodal graph-language reasoning

Existing evaluations do not require integration of structural and semantic information

Performance gaps reveal limitations in graph reasoning capabilities of GLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introducing CLEGR benchmark for multimodal reasoning

Synthetic graph generation with joint reasoning questions

Evaluating GLM architectures against soft-prompted LLM baselines

🔎 Similar Papers

LLM-Enhanced User-Item Interactions: Leveraging Edge Information for Optimized Recommendations