Scheherazade: Evaluating Chain-of-Thought Math Reasoning in LLMs with Chain-of-Problems

📅 2024-09-30
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing mathematical reasoning benchmarks (e.g., GSM8K) suffer from fixed difficulty levels and high manual construction costs, limiting their ability to discriminate fine-grained differences in advanced LLMs’ reasoning capabilities. Method: We propose an automated chain-of-problems generation framework that integrates forward chaining, backward chaining, and stochastic branching to enable controllable complexity escalation and large-scale synthesis of high-difficulty problems. Contribution/Results: Our method overcomes the bottlenecks of manual benchmark construction and yields GSM8K-Scheherazade—the first benchmark enabling fine-grained assessment of reasoning abilities. Experiments reveal, for the first time, o1-preview’s unique performance gain with increasing backward-chain length, while mainstream LLMs exhibit precipitous degradation in long-chain reasoning. This work establishes a scalable, interpretable paradigm for evaluating LLM reasoning capabilities.

Technology Category

Application Category

📝 Abstract
Benchmarks are critical for measuring Large Language Model (LLM) reasoning capabilities. Some benchmarks have even become the de facto indicator of such capabilities. However, as LLM reasoning capabilities improve, existing widely-used benchmarks such as GSM8K marginally encapsulate model reasoning differentials - most state-of-the-art models for example achieve over 94% accuracy on the GSM8K dataset (paperwithcode, 2024). While constructing harder benchmarks is possible, their creation is often manual, expensive, and unscalable. As such, we present Scheherazade, an automated approach to produce large quantities of challenging mathematical reasoning benchmarks by logically chaining a small starting set of problems. We propose two different chaining methods, forward chaining and backward chaining, which include randomized branching techniques to generate complex reasoning problems. We apply Scheherazade on GSM8K to create GSM8K-Scheherazade and evaluate 3 frontier LLMs and OpenAI's o1-preview on it. We show that while other frontier models' performance declines precipitously at only a few questions chained, our evaluation suggests o1-preview's performance persists, with the flagship OpenAI model the only one to perform better at backward reasoning. Our data and code are available at https://github.com/YoshikiTakashima/scheherazade-code-data.
Problem

Research questions and friction points this paper is trying to address.

Automated generation of challenging math benchmarks
Evaluating LLM reasoning with Chain-of-Problems
Performance comparison of frontier LLMs on novel benchmarks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated benchmark generation
Forward and backward chaining
Randomized branching techniques
🔎 Similar Papers
No similar papers found.
S
Stephen Miner
Yale University
Y
Yoshiki Takashima
Yale University
S
Simeng Han
Yale University
Ferhat Erata
Ferhat Erata
Yale University
Neuro-Symbolic AIAutomated ReasoningAlignmentSecurity & Privacy
Timos Antonopoulos
Timos Antonopoulos
Research Scientist, Yale University
R
R. Piskac
Yale University
S
Scott J. Shapiro
Yale University