🤖 AI Summary
Existing benchmarks struggle to diagnose fine-grained performance degradation in large language models (LLMs) when generating graph-structured data as structural complexity increases. This work introduces the first hierarchy-based graph generation benchmark, spanning six levels of structural complexity and five evaluation dimensions, incorporating 800 handcrafted instructions and 1,582 algorithmically generated reference graphs to systematically assess 12 LLMs under diverse prompting strategies. The study reveals that performance bottlenecks stem primarily from the combination of multiple constraints rather than reasoning depth alone. To address this, the authors propose a verification-guided iterative framework that substantially outperforms existing prompt engineering techniques. Additionally, they find that domain-specific semantic constraints are resistant to iterative refinement, suggesting retrieval-augmented approaches as a promising direction for future research.
📝 Abstract
Graph-structured data underpins applications from citation analysis and social-network modeling to molecular design and knowledge-graph construction, and Large Language Models (LLMs) are increasingly used as prompt-driven graph synthesizers. Classical graph-generation reviews catalog deep generative models and their evaluation primitives, but predate the LLM era and provide no foundation for evaluating instruction-following graph synthesis. Recent LLM-era benchmarks evaluate models along graph-type or task-domain axes; such organizations, however, average over structural complexity and cannot localize where in the complexity spectrum an LLM breaks down. To close this diagnostic gap, we introduce GraphInstruct, a progressive-complexity benchmark that stratifies LLM graph generation into six complexity levels and five evaluation dimensions, paired with 800 hand-authored instructions, 1,582 algorithmically synthesized reference solutions, and a 12-LLM capability evaluation across 45 (model, strategy) configurations. We find that discriminative power peaks at multi-constraint composition rather than reasoning depth, that no single prompting strategy dominates across levels or model families, and that domain-semantic constraints remain iteration-invariant under all tested methods -- pointing to retrieval rather than additional compute as the next research frontier. Atop the benchmark, a verification-guided iterative framework with constraint-aware adaptive prompting consistently surpasses the prompt-engineering ceiling on tested target models, demonstrating that the benchmark's fine-grained signals drive method development.