🤖 AI Summary
Existing mathematical reasoning benchmarks (e.g., GSM8K) suffer from fixed difficulty levels and high manual construction costs, limiting their ability to discriminate fine-grained differences in advanced LLMs’ reasoning capabilities.
Method: We propose an automated chain-of-problems generation framework that integrates forward chaining, backward chaining, and stochastic branching to enable controllable complexity escalation and large-scale synthesis of high-difficulty problems.
Contribution/Results: Our method overcomes the bottlenecks of manual benchmark construction and yields GSM8K-Scheherazade—the first benchmark enabling fine-grained assessment of reasoning abilities. Experiments reveal, for the first time, o1-preview’s unique performance gain with increasing backward-chain length, while mainstream LLMs exhibit precipitous degradation in long-chain reasoning. This work establishes a scalable, interpretable paradigm for evaluating LLM reasoning capabilities.
📝 Abstract
Benchmarks are critical for measuring Large Language Model (LLM) reasoning capabilities. Some benchmarks have even become the de facto indicator of such capabilities. However, as LLM reasoning capabilities improve, existing widely-used benchmarks such as GSM8K marginally encapsulate model reasoning differentials - most state-of-the-art models for example achieve over 94% accuracy on the GSM8K dataset (paperwithcode, 2024). While constructing harder benchmarks is possible, their creation is often manual, expensive, and unscalable. As such, we present Scheherazade, an automated approach to produce large quantities of challenging mathematical reasoning benchmarks by logically chaining a small starting set of problems. We propose two different chaining methods, forward chaining and backward chaining, which include randomized branching techniques to generate complex reasoning problems. We apply Scheherazade on GSM8K to create GSM8K-Scheherazade and evaluate 3 frontier LLMs and OpenAI's o1-preview on it. We show that while other frontier models' performance declines precipitously at only a few questions chained, our evaluation suggests o1-preview's performance persists, with the flagship OpenAI model the only one to perform better at backward reasoning. Our data and code are available at https://github.com/YoshikiTakashima/scheherazade-code-data.