PuzzlePlex: Benchmarking Foundation Models on Reasoning and Planning with Puzzles

📅 2025-10-07
📈 Citations: 0
✨ Influential: 0
📄 PDF
🤖 AI Summary
Evaluating the reasoning and planning capabilities of foundation models in complex, dynamic environments remains challenging due to the lack of comprehensive, fine-grained benchmarks. Method: We propose PuzzlePlex, a novel benchmark framework comprising 15 diverse puzzle categories—spanning deterministic/stochastic dynamics and single-/two-player games—and introducing scalable, fine-grained evaluation metrics. PuzzlePlex systematically compares instruction-based versus code-based execution paradigms, integrating customized game strategies, instruction tuning, and a code generation–execution closed loop. Contribution/Results: Experiments reveal that reasoning-specialized models excel in instruction-based execution, whereas code-based execution—though more demanding—demonstrates significantly superior scalability and execution efficiency across puzzle types. PuzzlePlex establishes a new standard for assessing model capabilities in dynamic environments and provides a clear technical pathway for architecture evolution and capability-oriented evaluation.

Technology Category

Application Category

📝 Abstract
This work investigates the reasoning and planning capabilities of foundation models and their scalability in complex, dynamic environments. We introduce PuzzlePlex, a benchmark designed to assess these capabilities through a diverse set of puzzles. PuzzlePlex consists of 15 types of puzzles, including deterministic and stochastic games of varying difficulty, as well as single-player and two-player scenarios. The PuzzlePlex framework provides a comprehensive environment for each game, and supports extensibility to generate more challenging instances as foundation models evolve. Additionally, we implement customized game-playing strategies for comparison. Building on this benchmark, we develop fine-grained metrics to measure performance and conduct an in-depth analysis of frontier foundation models across two settings: instruction-based and code-based. Furthermore, we systematically investigate their scaling limits. Our findings show that reasoning models outperform others in instruction-based settings, while code-based execution presents greater challenges but offers a scalable and efficient alternative. PuzzlePlex enables targeted evaluation and guides future improvements in reasoning, planning, and generalization for foundation models.
Problem

Research questions and friction points this paper is trying to address.

Assessing reasoning and planning capabilities of foundation models
Evaluating model scalability in complex dynamic environments
Benchmarking performance across diverse puzzle types and settings
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces PuzzlePlex benchmark for reasoning evaluation
Implements fine-grained metrics for performance analysis
Compares instruction-based and code-based execution methods
🔎 Similar Papers
No similar papers found.