Actions Speak Louder than Prompts: A Large-Scale Study of LLMs for Graph Inference

📅 2025-09-22

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Prior work lacks a systematic evaluation of large language models’ (LLMs) reasoning capabilities on text-rich graph data—such as node classification in fraud detection and recommendation systems. Method: This paper presents the first principled benchmark across three LLM-based paradigms—prompt engineering, tool invocation, and code generation—evaluating them along multiple dimensions: LLM-graph interaction patterns, application domains, graph structural properties (e.g., heterophily, connectivity), and feature-label dependencies. Contribution/Results: Code generation substantially outperforms other paradigms, demonstrating superior robustness on graphs with long textual node attributes, high connectivity, and strong heterophily. LLMs exhibit structure-semantic adaptive information selection—dynamically prioritizing relevant structural and semantic cues. The study delineates performance boundaries and applicability conditions for each method, yielding reproducible, design-oriented guidelines for building efficient LLM-driven graph reasoning systems.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are increasingly used for text-rich graph machine learning tasks such as node classification in high-impact domains like fraud detection and recommendation systems. Yet, despite a surge of interest, the field lacks a principled understanding of the capabilities of LLMs in their interaction with graph data. In this work, we conduct a large-scale, controlled evaluation across several key axes of variability to systematically assess the strengths and weaknesses of LLM-based graph reasoning methods in text-based applications. The axes include the LLM-graph interaction mode, comparing prompting, tool-use, and code generation; dataset domains, spanning citation, web-link, e-commerce, and social networks; structural regimes contrasting homophilic and heterophilic graphs; feature characteristics involving both short- and long-text node attributes; and model configurations with varying LLM sizes and reasoning capabilities. We further analyze dependencies by methodically truncating features, deleting edges, and removing labels to quantify reliance on input types. Our findings provide practical and actionable guidance. (1) LLMs as code generators achieve the strongest overall performance on graph data, with especially large gains on long-text or high-degree graphs where prompting quickly exceeds the token budget. (2) All interaction strategies remain effective on heterophilic graphs, challenging the assumption that LLM-based methods collapse under low homophily. (3) Code generation is able to flexibly adapt its reliance between structure, features, or labels to leverage the most informative input type. Together, these findings provide a comprehensive view of the strengths and limitations of current LLM-graph interaction modes and highlight key design principles for future approaches.

Problem

Research questions and friction points this paper is trying to address.

Systematically evaluating LLM capabilities for graph reasoning tasks across multiple variability axes

Assessing LLM-graph interaction modes including prompting, tool-use and code generation

Quantifying LLM reliance on different input types like structure, features and labels

Innovation

Methods, ideas, or system contributions that make the work stand out.

Code generation achieves strongest graph inference performance

All interaction strategies remain effective on heterophilic graphs

Code generation flexibly adapts reliance between input types

🔎 Similar Papers

LLM-Enhanced User-Item Interactions: Leveraging Edge Information for Optimized Recommendations