LogiPlan: A Structured Benchmark for Logical Planning and Relational Reasoning in LLMs

📅 2025-06-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Evaluating large language models’ (LLMs) capabilities in logical planning and complex relational reasoning remains challenging due to the lack of dedicated, controllable benchmarks. Method: We introduce LogiPlan—the first benchmark explicitly designed for this purpose—based on dynamically generated graph-structured tasks with tunable complexity. It employs a three-task evaluation framework: plan generation, consistency verification, and comparative question answering, augmented by a self-correction capability assessment mechanism. Our methodology integrates symbolic relational graph modeling, structural-constraint-guided generation, and multi-turn self-validation prompting. Contribution/Results: Comprehensive evaluation across state-of-the-art models—including GPT-4.5, Llama 3.1 405B, and Claude 3.7—reveals significant performance gaps on deep logical planning tasks, with strong dependence on model scale and inference architecture. LogiPlan enables fine-grained difficulty control and chain-of-logic depth analysis, establishing a new standard for structured reasoning evaluation.

Technology Category

Application Category

📝 Abstract
We introduce LogiPlan, a novel benchmark designed to evaluate the capabilities of large language models (LLMs) in logical planning and reasoning over complex relational structures. Logical relational reasoning is important for applications that may rely on LLMs to generate and query structured graphs of relations such as network infrastructure, knowledge bases, or business process schema. Our framework allows for dynamic variation of task complexity by controlling the number of objects, relations, and the minimum depth of relational chains, providing a fine-grained assessment of model performance across difficulty levels. LogiPlan encompasses three complementary tasks: (1) Plan Generation, where models must construct valid directed relational graphs meeting specified structural constraints; (2) Consistency Detection, testing models' ability to identify inconsistencies in relational structures; and (3) Comparison Question, evaluating models' capacity to determine the validity of queried relationships within a given graph. Additionally, we assess models' self-correction capabilities by prompting them to verify and refine their initial solutions. We evaluate state-of-the-art models including DeepSeek R1, Gemini 2.0 Pro, Gemini 2 Flash Thinking, GPT-4.5, GPT-4o, Llama 3.1 405B, O3-mini, O1, and Claude 3.7 Sonnet across these tasks, revealing significant performance gaps that correlate with model scale and architecture. Our analysis demonstrates that while recent reasoning-enhanced models show promising results on simpler instances, they struggle with more complex configurations requiring deeper logical planning.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' logical planning and relational reasoning abilities
Assessing model performance across varying task complexity levels
Testing consistency detection and relationship validation in structured graphs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic task complexity control via object and relation variation
Three tasks: Plan Generation, Consistency Detection, Comparison Question
Self-correction assessment through solution verification and refinement
🔎 Similar Papers
No similar papers found.
Y
Yanan Cai
Microsoft Azure
A
Ahmed Salem
Microsoft
Besmira Nushi
Besmira Nushi
Microsoft Research
Machine LearningResponsible AI
M
M. Russinovich
Microsoft Azure