Provable Benefit of Curriculum in Transformer Tree-Reasoning Post-Training

📅 2025-11-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The empirical benefits of curriculum learning in post-training inference for large language models (LLMs) lack principled theoretical justification. Method: We propose curriculum strategies based on incrementally increasing reasoning-chain depth or decrementally shortening prompt length, and introduce a state-conditioned autoregressive reasoning tree model. This framework enables curriculum-aware fine-tuning via reinforcement learning under outcome-only reward signals. Contribution: We provide the first theoretical proof that curriculum learning can overcome the exponential sample complexity barrier inherent in tree-structured reasoning—reducing it to polynomial order—and establish polynomial-cost scaling guarantees for test-time inference. Experiments demonstrate substantial improvements in reasoning accuracy, alongside significant reductions in sampling overhead and API query costs.

Technology Category

Application Category

📝 Abstract
Recent curriculum techniques in the post-training stage of LLMs have been widely observed to outperform non-curriculum approaches in enhancing reasoning performance, yet a principled understanding of why and to what extent they work remains elusive. To address this gap, we develop a theoretical framework grounded in the intuition that progressively learning through manageable steps is more efficient than directly tackling a hard reasoning task, provided each stage stays within the model's effective competence. Under mild complexity conditions linking consecutive curriculum stages, we show that curriculum post-training avoids the exponential complexity bottleneck. To substantiate this result, drawing insights from the Chain-of-Thoughts (CoTs) solving mathematical problems such as Countdown and parity, we model CoT generation as a states-conditioned autoregressive reasoning tree, define a uniform-branching base model to capture pretrained behavior, and formalize curriculum stages as either depth-increasing (longer reasoning chains) or hint-decreasing (shorter prefixes) subtasks. Our analysis shows that, under outcome-only reward signals, reinforcement learning finetuning achieves high accuracy with polynomial sample complexity, whereas direct learning suffers from an exponential bottleneck. We further establish analogous guarantees for test-time scaling, where curriculum-aware querying reduces both reward oracle calls and sampling cost from exponential to polynomial order.
Problem

Research questions and friction points this paper is trying to address.

Understanding why curriculum learning outperforms direct training in reasoning tasks
Modeling Chain-of-Thought generation as autoregressive reasoning trees for mathematical problems
Establishing theoretical guarantees for curriculum learning's polynomial complexity benefits
Innovation

Methods, ideas, or system contributions that make the work stand out.

Curriculum post-training avoids exponential complexity bottleneck
Reinforcement learning finetuning achieves polynomial sample complexity
Curriculum-aware querying reduces sampling cost to polynomial order
🔎 Similar Papers
No similar papers found.