Scheherazade: Evaluating Chain-of-Thought Math Reasoning in LLMs with Chain-of-Problems

📅 2024-09-30

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

Existing mathematical reasoning benchmarks (e.g., GSM8K) suffer from fixed difficulty levels and high manual construction costs, limiting their ability to discriminate fine-grained differences in advanced LLMs’ reasoning capabilities. Method: We propose an automated chain-of-problems generation framework that integrates forward chaining, backward chaining, and stochastic branching to enable controllable complexity escalation and large-scale synthesis of high-difficulty problems. Contribution/Results: Our method overcomes the bottlenecks of manual benchmark construction and yields GSM8K-Scheherazade—the first benchmark enabling fine-grained assessment of reasoning abilities. Experiments reveal, for the first time, o1-preview’s unique performance gain with increasing backward-chain length, while mainstream LLMs exhibit precipitous degradation in long-chain reasoning. This work establishes a scalable, interpretable paradigm for evaluating LLM reasoning capabilities.

Technology Category

Application Category

📝 Abstract

Benchmarks are critical for measuring Large Language Model (LLM) reasoning capabilities. Some benchmarks have even become the de facto indicator of such capabilities. However, as LLM reasoning capabilities improve, existing widely-used benchmarks such as GSM8K marginally encapsulate model reasoning differentials - most state-of-the-art models for example achieve over 94% accuracy on the GSM8K dataset (paperwithcode, 2024). While constructing harder benchmarks is possible, their creation is often manual, expensive, and unscalable. As such, we present Scheherazade, an automated approach to produce large quantities of challenging mathematical reasoning benchmarks by logically chaining a small starting set of problems. We propose two different chaining methods, forward chaining and backward chaining, which include randomized branching techniques to generate complex reasoning problems. We apply Scheherazade on GSM8K to create GSM8K-Scheherazade and evaluate 3 frontier LLMs and OpenAI's o1-preview on it. We show that while other frontier models' performance declines precipitously at only a few questions chained, our evaluation suggests o1-preview's performance persists, with the flagship OpenAI model the only one to perform better at backward reasoning. Our data and code are available at https://github.com/YoshikiTakashima/scheherazade-code-data.

Problem

Research questions and friction points this paper is trying to address.

Automated generation of challenging math benchmarks

Evaluating LLM reasoning with Chain-of-Problems

Performance comparison of frontier LLMs on novel benchmarks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated benchmark generation

Forward and backward chaining

Randomized branching techniques

🔎 Similar Papers

No similar papers found.

Nvidia

30 USD - 94 USD

US, CA, Santa Clara

Senior Research Scientist - Machine Learning System

ByteDance

United States / China / Singapore

Authors to Follow