🤖 AI Summary
Large language models (LLMs) exhibit weak reasoning capabilities on search-based logical problems—tasks that are intuitive for humans but challenging for LLMs due to the need for backtracking and multi-path exploration.
Method: We introduce SearchBench, the first benchmark supporting automatic instance generation and solution-quality verification for such problems. We propose a two-stage, multi-attempt prompting paradigm: (1) augmenting in-context learning with A* algorithm demonstrations, and (2) generating and validating executable code via unit-test-driven synthesis.
Contribution/Results: Our approach significantly enhances LLMs’ ability to model search logic. On SearchBench, it boosts GPT-4’s end-to-end solving rate from 1.4% to 57.1%, substantially outperforming baseline methods—including pure code-generation approaches. SearchBench establishes a novel evaluation standard for complex reasoning in LLMs, while our prompting paradigm offers an effective framework for reasoning enhancement.
📝 Abstract
Recently, Large Language Models (LLMs) attained impressive performance in math and reasoning benchmarks. However, they still often struggle with logic problems and puzzles that are relatively easy for humans. To further investigate this, we introduce a new benchmark, SearchBench, containing 11 unique search problem types, each equipped with automated pipelines to generate an arbitrary number of instances and analyze the feasibility, correctness, and optimality of LLM-generated solutions. We show that even the most advanced LLMs fail to solve these problems end-to-end in text, e.g. GPT4 solves only 1.4%. SearchBench problems require considering multiple pathways to the solution as well as backtracking, posing a significant challenge to auto-regressive models. Instructing LLMs to generate code that solves the problem helps, but only slightly, e.g., GPT4's performance rises to 11.7%. In this work, we show that in-context learning with A* algorithm implementations enhances performance. The full potential of this promoting approach emerges when combined with our proposed Multi-Stage-Multi-Try method, which breaks down the algorithm implementation into two stages and verifies the first stage against unit tests, raising GPT-4's performance above 57%.