Provable Benefit of Curriculum in Transformer Tree-Reasoning Post-Training

📅 2025-11-10

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

The empirical benefits of curriculum learning in post-training inference for large language models (LLMs) lack principled theoretical justification. Method: We propose curriculum strategies based on incrementally increasing reasoning-chain depth or decrementally shortening prompt length, and introduce a state-conditioned autoregressive reasoning tree model. This framework enables curriculum-aware fine-tuning via reinforcement learning under outcome-only reward signals. Contribution: We provide the first theoretical proof that curriculum learning can overcome the exponential sample complexity barrier inherent in tree-structured reasoning—reducing it to polynomial order—and establish polynomial-cost scaling guarantees for test-time inference. Experiments demonstrate substantial improvements in reasoning accuracy, alongside significant reductions in sampling overhead and API query costs.

Technology Category

Application Category

📝 Abstract

Recent curriculum techniques in the post-training stage of LLMs have been widely observed to outperform non-curriculum approaches in enhancing reasoning performance, yet a principled understanding of why and to what extent they work remains elusive. To address this gap, we develop a theoretical framework grounded in the intuition that progressively learning through manageable steps is more efficient than directly tackling a hard reasoning task, provided each stage stays within the model's effective competence. Under mild complexity conditions linking consecutive curriculum stages, we show that curriculum post-training avoids the exponential complexity bottleneck. To substantiate this result, drawing insights from the Chain-of-Thoughts (CoTs) solving mathematical problems such as Countdown and parity, we model CoT generation as a states-conditioned autoregressive reasoning tree, define a uniform-branching base model to capture pretrained behavior, and formalize curriculum stages as either depth-increasing (longer reasoning chains) or hint-decreasing (shorter prefixes) subtasks. Our analysis shows that, under outcome-only reward signals, reinforcement learning finetuning achieves high accuracy with polynomial sample complexity, whereas direct learning suffers from an exponential bottleneck. We further establish analogous guarantees for test-time scaling, where curriculum-aware querying reduces both reward oracle calls and sampling cost from exponential to polynomial order.

Problem

Research questions and friction points this paper is trying to address.

Understanding why curriculum learning outperforms direct training in reasoning tasks

Modeling Chain-of-Thought generation as autoregressive reasoning trees for mathematical problems

Establishing theoretical guarantees for curriculum learning's polynomial complexity benefits

Innovation

Methods, ideas, or system contributions that make the work stand out.

Curriculum post-training avoids exponential complexity bottleneck

Reinforcement learning finetuning achieves polynomial sample complexity

Curriculum-aware querying reduces sampling cost to polynomial order

🔎 Similar Papers

Distributional Associations vs In-Context Reasoning: A Study of Feed-forward and Attention Layers