🤖 AI Summary
This paper addresses the lack of systematic evaluation of large language models’ (LLMs) graph reasoning capabilities by introducing GraphOmni—the first comprehensive benchmark for this task. Methodologically, it establishes a multidimensional evaluation framework covering diverse graph structures, serialization schemes (e.g., DFS, BFS, adjacency lists), and prompting strategies; it further proposes a PPO-based reinforcement learning approach for dynamic, adaptive pairing of serialization methods and prompts. Key contributions include: (1) the first fine-grained, reproducible assessment of LLMs’ graph reasoning; (2) empirical identification of critical limitations in structural awareness, long-range dependency modeling, and format robustness; and (3) an open-source, modular, and extensible evaluation infrastructure that substantially improves accuracy on graph tasks—laying foundational groundwork for developing general-purpose graph reasoning models.
📝 Abstract
In this paper, we presented GraphOmni, a comprehensive benchmark framework for systematically evaluating the graph reasoning capabilities of LLMs. By analyzing critical dimensions, including graph types, serialization formats, and prompt schemes, we provided extensive insights into the strengths and limitations of current LLMs. Our empirical findings emphasize that no single serialization or prompting strategy consistently outperforms others. Motivated by these insights, we propose a reinforcement learning-based approach that dynamically selects the best serialization-prompt pairings, resulting in significant accuracy improvements. GraphOmni's modular and extensible design establishes a robust foundation for future research, facilitating advancements toward general-purpose graph reasoning models.