Inference-Time Computations for LLM Reasoning and Planning: A Benchmark and Insights

📅 2025-02-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the impact of inference-time scaling techniques on large language models’ (LLMs) ability to solve complex tasks and examines the associated computational–performance trade-offs. To this end, we introduce Sys2Bench—a novel, open-source, and reproducible benchmark covering five task categories: arithmetic, logic, commonsense reasoning, algorithms, and planning. We systematically evaluate mainstream inference-time methods—including multi-step chain-of-thought, self-verification, beam search, Tree-of-Thought (ToT), and self-consistency ensembling—across 11 diverse tasks. Results reveal that no single technique generalizes across all task types, challenging the implicit assumption that “more computation yields better performance.” Instead, we observe pronounced diminishing returns and strong task-specific bottlenecks. As a standardized, publicly available evaluation framework, Sys2Bench has already been adopted by multiple follow-up studies, establishing a unified foundation for rigorous assessment of inference-time reasoning capabilities.

Technology Category

Application Category

📝 Abstract
We examine the reasoning and planning capabilities of large language models (LLMs) in solving complex tasks. Recent advances in inference-time techniques demonstrate the potential to enhance LLM reasoning without additional training by exploring intermediate steps during inference. Notably, OpenAI's o1 model shows promising performance through its novel use of multi-step reasoning and verification. Here, we explore how scaling inference-time techniques can improve reasoning and planning, focusing on understanding the tradeoff between computational cost and performance. To this end, we construct a comprehensive benchmark, known as Sys2Bench, and perform extensive experiments evaluating existing inference-time techniques on eleven diverse tasks across five categories, including arithmetic reasoning, logical reasoning, common sense reasoning, algorithmic reasoning, and planning. Our findings indicate that simply scaling inference-time computation has limitations, as no single inference-time technique consistently performs well across all reasoning and planning tasks.
Problem

Research questions and friction points this paper is trying to address.

Enhance LLM reasoning without training.
Explore tradeoff between cost and performance.
Evaluate inference techniques across diverse tasks.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Inference-time techniques enhance reasoning
Multi-step reasoning improves LLM performance
Sys2Bench evaluates diverse reasoning tasks
Shubham Parashar
Shubham Parashar
Graduate Student at Texas A&M University
Natural Language ProcessingComputer VisionMachine Learning
B
Blake Olson
Department of Computer Science & Engineering, Texas A&M University
S
Sambhav Khurana
Department of Computer Science & Engineering, Texas A&M University
E
Eric Li
Department of Computer Science & Engineering, Texas A&M University
Hongyi Ling
Hongyi Ling
Texas A&M University
Graph Neural NetworksTrustworthy AI
James Caverlee
James Caverlee
Professor, Computer Science and Engineering, Texas A&M University
Recommender systemsInformation retrievalData miningSocial mediaData-intensive systems
S
Shuiwang Ji
Department of Computer Science & Engineering, Texas A&M University