ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates

📅 2025-02-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the problems of redundant search space and inefficient reasoning paths in large language models (LLMs) for complex mathematical reasoning. Methodologically, it proposes a Hierarchical Template Scaling framework comprising: (1) a library of over 500 generalizable, high-level structured reasoning templates; (2) hierarchical reinforcement learning applied over template sequences to optimize chain-of-thought (CoT) trajectory planning; and (3) an inference-time adaptive template scaling mechanism that dynamically adjusts template granularity and invocation depth. The resulting ReasonFlux-32B model achieves 91.2% accuracy on the MATH benchmark—surpassing o1-preview by 6.7 percentage points—and solves 56.7% of AIME problems, outperforming o1-preview and DeepSeek-V3 by 27.0 and 45.0 percentage points, respectively. These results demonstrate substantial improvements in both accuracy and generalization for mathematical reasoning.

Technology Category

Application Category

📝 Abstract
We present that hierarchical LLM reasoning via scaling thought templates can effectively optimize the reasoning search space and outperform the mathematical reasoning capabilities of powerful LLMs like OpenAI o1-preview and DeepSeek V3. We train our ReasonFlux-32B model with only 8 GPUs and introduces three innovations: (i) a structured and generic thought template library, containing around 500 high-level thought templates capable of generalizing to similar or relevant reasoning problems; (ii) performing hierarchical reinforcement learning on a sequence of thought templates instead of long CoTs, optimizing a base LLM to plan out an optimal template trajectory for gradually handling complex problems; (iii) a brand new inference scaling system that enables hierarchical LLM reasoning by adaptively scaling thought templates at inference time. With a template trajectory containing sequential thought templates, our ReasonFlux-32B significantly advances math reasoning capabilities to state-of-the-art levels. Notably, on the MATH benchmark, it achieves an accuracy of 91.2% and surpasses o1-preview by 6.7%. On the USA Math Olympiad (AIME) benchmark, ReasonFlux-32B solves an average of 56.7% of problems, surpassing o1-preview and DeepSeek-V3 by 27% and 45%, respectively. Code: https://github.com/Gen-Verse/ReasonFlux
Problem

Research questions and friction points this paper is trying to address.

Optimize reasoning search space using hierarchical scaling
Enhance math reasoning beyond powerful LLMs
Introduce adaptive thought template scaling system
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical LLM reasoning
Thought template library
Inference scaling system