Navigating the Labyrinth: Evaluating and Enhancing LLMs' Ability to Reason About Search Problems

📅 2024-06-18

🏛️ arXiv.org

📈 Citations: 6

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Large language models (LLMs) exhibit weak reasoning capabilities on search-based logical problems—tasks that are intuitive for humans but challenging for LLMs due to the need for backtracking and multi-path exploration. Method: We introduce SearchBench, the first benchmark supporting automatic instance generation and solution-quality verification for such problems. We propose a two-stage, multi-attempt prompting paradigm: (1) augmenting in-context learning with A* algorithm demonstrations, and (2) generating and validating executable code via unit-test-driven synthesis. Contribution/Results: Our approach significantly enhances LLMs’ ability to model search logic. On SearchBench, it boosts GPT-4’s end-to-end solving rate from 1.4% to 57.1%, substantially outperforming baseline methods—including pure code-generation approaches. SearchBench establishes a novel evaluation standard for complex reasoning in LLMs, while our prompting paradigm offers an effective framework for reasoning enhancement.

Technology Category

Application Category

📝 Abstract

Recently, Large Language Models (LLMs) attained impressive performance in math and reasoning benchmarks. However, they still often struggle with logic problems and puzzles that are relatively easy for humans. To further investigate this, we introduce a new benchmark, SearchBench, containing 11 unique search problem types, each equipped with automated pipelines to generate an arbitrary number of instances and analyze the feasibility, correctness, and optimality of LLM-generated solutions. We show that even the most advanced LLMs fail to solve these problems end-to-end in text, e.g. GPT4 solves only 1.4%. SearchBench problems require considering multiple pathways to the solution as well as backtracking, posing a significant challenge to auto-regressive models. Instructing LLMs to generate code that solves the problem helps, but only slightly, e.g., GPT4's performance rises to 11.7%. In this work, we show that in-context learning with A* algorithm implementations enhances performance. The full potential of this promoting approach emerges when combined with our proposed Multi-Stage-Multi-Try method, which breaks down the algorithm implementation into two stages and verifies the first stage against unit tests, raising GPT-4's performance above 57%.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' reasoning on complex search problems

Testing LLM performance in backtracking and multi-path scenarios

Assessing automated solution feasibility, correctness, and optimality

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces SearchBench benchmark for search problems

Uses A* algorithm with in-context learning

Implements Multi-Stage-Multi-Try inference method

🔎 Similar Papers

No similar papers found.