🤖 AI Summary
This work addresses the problems of redundant search space and inefficient reasoning paths in large language models (LLMs) for complex mathematical reasoning. Methodologically, it proposes a Hierarchical Template Scaling framework comprising: (1) a library of over 500 generalizable, high-level structured reasoning templates; (2) hierarchical reinforcement learning applied over template sequences to optimize chain-of-thought (CoT) trajectory planning; and (3) an inference-time adaptive template scaling mechanism that dynamically adjusts template granularity and invocation depth. The resulting ReasonFlux-32B model achieves 91.2% accuracy on the MATH benchmark—surpassing o1-preview by 6.7 percentage points—and solves 56.7% of AIME problems, outperforming o1-preview and DeepSeek-V3 by 27.0 and 45.0 percentage points, respectively. These results demonstrate substantial improvements in both accuracy and generalization for mathematical reasoning.
📝 Abstract
We present that hierarchical LLM reasoning via scaling thought templates can effectively optimize the reasoning search space and outperform the mathematical reasoning capabilities of powerful LLMs like OpenAI o1-preview and DeepSeek V3. We train our ReasonFlux-32B model with only 8 GPUs and introduces three innovations: (i) a structured and generic thought template library, containing around 500 high-level thought templates capable of generalizing to similar or relevant reasoning problems; (ii) performing hierarchical reinforcement learning on a sequence of thought templates instead of long CoTs, optimizing a base LLM to plan out an optimal template trajectory for gradually handling complex problems; (iii) a brand new inference scaling system that enables hierarchical LLM reasoning by adaptively scaling thought templates at inference time. With a template trajectory containing sequential thought templates, our ReasonFlux-32B significantly advances math reasoning capabilities to state-of-the-art levels. Notably, on the MATH benchmark, it achieves an accuracy of 91.2% and surpasses o1-preview by 6.7%. On the USA Math Olympiad (AIME) benchmark, ReasonFlux-32B solves an average of 56.7% of problems, surpassing o1-preview and DeepSeek-V3 by 27% and 45%, respectively. Code: https://github.com/Gen-Verse/ReasonFlux