Inference-Time Computations for LLM Reasoning and Planning: A Benchmark and Insights

📅 2025-02-18

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

This study investigates the impact of inference-time scaling techniques on large language models’ (LLMs) ability to solve complex tasks and examines the associated computational–performance trade-offs. To this end, we introduce Sys2Bench—a novel, open-source, and reproducible benchmark covering five task categories: arithmetic, logic, commonsense reasoning, algorithms, and planning. We systematically evaluate mainstream inference-time methods—including multi-step chain-of-thought, self-verification, beam search, Tree-of-Thought (ToT), and self-consistency ensembling—across 11 diverse tasks. Results reveal that no single technique generalizes across all task types, challenging the implicit assumption that “more computation yields better performance.” Instead, we observe pronounced diminishing returns and strong task-specific bottlenecks. As a standardized, publicly available evaluation framework, Sys2Bench has already been adopted by multiple follow-up studies, establishing a unified foundation for rigorous assessment of inference-time reasoning capabilities.

Technology Category

Application Category

📝 Abstract

We examine the reasoning and planning capabilities of large language models (LLMs) in solving complex tasks. Recent advances in inference-time techniques demonstrate the potential to enhance LLM reasoning without additional training by exploring intermediate steps during inference. Notably, OpenAI's o1 model shows promising performance through its novel use of multi-step reasoning and verification. Here, we explore how scaling inference-time techniques can improve reasoning and planning, focusing on understanding the tradeoff between computational cost and performance. To this end, we construct a comprehensive benchmark, known as Sys2Bench, and perform extensive experiments evaluating existing inference-time techniques on eleven diverse tasks across five categories, including arithmetic reasoning, logical reasoning, common sense reasoning, algorithmic reasoning, and planning. Our findings indicate that simply scaling inference-time computation has limitations, as no single inference-time technique consistently performs well across all reasoning and planning tasks.

Problem

Research questions and friction points this paper is trying to address.

Enhance LLM reasoning without training.

Explore tradeoff between cost and performance.

Evaluate inference techniques across diverse tasks.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Inference-time techniques enhance reasoning

Multi-step reasoning improves LLM performance

Sys2Bench evaluates diverse reasoning tasks

🔎 Similar Papers

Can Large Language Models be Good Path Planners? A Benchmark and Investigation on Spatial-temporal Reasoning