đ¤ AI Summary
Evaluating the reasoning and planning capabilities of foundation models in complex, dynamic environments remains challenging due to the lack of comprehensive, fine-grained benchmarks.
Method: We propose PuzzlePlex, a novel benchmark framework comprising 15 diverse puzzle categoriesâspanning deterministic/stochastic dynamics and single-/two-player gamesâand introducing scalable, fine-grained evaluation metrics. PuzzlePlex systematically compares instruction-based versus code-based execution paradigms, integrating customized game strategies, instruction tuning, and a code generationâexecution closed loop.
Contribution/Results: Experiments reveal that reasoning-specialized models excel in instruction-based execution, whereas code-based executionâthough more demandingâdemonstrates significantly superior scalability and execution efficiency across puzzle types. PuzzlePlex establishes a new standard for assessing model capabilities in dynamic environments and provides a clear technical pathway for architecture evolution and capability-oriented evaluation.
đ Abstract
This work investigates the reasoning and planning capabilities of foundation models and their scalability in complex, dynamic environments. We introduce PuzzlePlex, a benchmark designed to assess these capabilities through a diverse set of puzzles. PuzzlePlex consists of 15 types of puzzles, including deterministic and stochastic games of varying difficulty, as well as single-player and two-player scenarios. The PuzzlePlex framework provides a comprehensive environment for each game, and supports extensibility to generate more challenging instances as foundation models evolve. Additionally, we implement customized game-playing strategies for comparison. Building on this benchmark, we develop fine-grained metrics to measure performance and conduct an in-depth analysis of frontier foundation models across two settings: instruction-based and code-based. Furthermore, we systematically investigate their scaling limits. Our findings show that reasoning models outperform others in instruction-based settings, while code-based execution presents greater challenges but offers a scalable and efficient alternative. PuzzlePlex enables targeted evaluation and guides future improvements in reasoning, planning, and generalization for foundation models.