🤖 AI Summary
This study investigates the impact of inference-time scaling techniques on large language models’ (LLMs) ability to solve complex tasks and examines the associated computational–performance trade-offs. To this end, we introduce Sys2Bench—a novel, open-source, and reproducible benchmark covering five task categories: arithmetic, logic, commonsense reasoning, algorithms, and planning. We systematically evaluate mainstream inference-time methods—including multi-step chain-of-thought, self-verification, beam search, Tree-of-Thought (ToT), and self-consistency ensembling—across 11 diverse tasks. Results reveal that no single technique generalizes across all task types, challenging the implicit assumption that “more computation yields better performance.” Instead, we observe pronounced diminishing returns and strong task-specific bottlenecks. As a standardized, publicly available evaluation framework, Sys2Bench has already been adopted by multiple follow-up studies, establishing a unified foundation for rigorous assessment of inference-time reasoning capabilities.
📝 Abstract
We examine the reasoning and planning capabilities of large language models (LLMs) in solving complex tasks. Recent advances in inference-time techniques demonstrate the potential to enhance LLM reasoning without additional training by exploring intermediate steps during inference. Notably, OpenAI's o1 model shows promising performance through its novel use of multi-step reasoning and verification. Here, we explore how scaling inference-time techniques can improve reasoning and planning, focusing on understanding the tradeoff between computational cost and performance. To this end, we construct a comprehensive benchmark, known as Sys2Bench, and perform extensive experiments evaluating existing inference-time techniques on eleven diverse tasks across five categories, including arithmetic reasoning, logical reasoning, common sense reasoning, algorithmic reasoning, and planning. Our findings indicate that simply scaling inference-time computation has limitations, as no single inference-time technique consistently performs well across all reasoning and planning tasks.