RL for Reasoning by Adaptively Revealing Rationales

📅 2025-06-22

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

In complex sequence generation tasks, supervised fine-tuning (SFT) relies on dense expert annotations, while reinforcement learning (RL) suffers from sparse rewards and combinatorial explosion in the action space. To address these dual challenges, this paper proposes AdaBack: an adaptive backtracking RL framework leveraging partial expert demonstrations. Its core innovation is a dynamic curriculum learning mechanism that adaptively adjusts the length of supervised prefixes per sample based on the model’s historical reward signals, progressively exposing correct reasoning paths. This approach overcomes the generalization limitations of SFT and the exploration bottlenecks of RL. Empirical evaluation on synthetic parity-constrained tasks and standard mathematical reasoning benchmarks—MATH and GSM8K—demonstrates significant improvements over pure RL baselines. Notably, AdaBack achieves, for the first time, effective solutions to ultra-long-chain reasoning problems, establishing new state-of-the-art performance in structured reasoning under sparse supervision.

Technology Category

Application Category

📝 Abstract

We propose that reinforcement learning (RL) from partial expert demonstrations is not merely a training heuristic, but a promising framework for solving complex sequence generation tasks. Supervised fine-tuning (SFT) relies on dense ground-truth labels, which become increasingly costly as sequence length grows. RL, on the other hand, struggles with sparse rewards and a combinatorially large output space. We address this by introducing adaptive backtracking (AdaBack), a per-sample curriculum learning algorithm that reveals only a partial prefix of the target output during training. The supervision length is adjusted dynamically for each sample based on the model's past reward signal, allowing it to incrementally learn to complete reasoning chains by conditioning on correct partial solutions. We investigate this intermediate regime between SFT and RL and argue that per-sample curriculum learning is more than a trade-off between efficiency and generality, it can succeed in tasks with long sequences of latent dependencies where SFT and RL both fail to generalize. Using a synthetic task with latent parity constraints, we show that our adaptive curriculum over partial answers reliably solves problems that are otherwise intractable. On mathematical reasoning benchmarks (MATH, GSM8k), we find that curriculum learning enables models to solve problems that RL alone cannot, acquiring new reasoning capabilities through incremental exposure to partial solutions.

Problem

Research questions and friction points this paper is trying to address.

RL struggles with sparse rewards in sequence generation

Supervised fine-tuning requires costly dense labels

Long sequences with latent dependencies challenge generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive backtracking for partial expert demonstrations

Per-sample curriculum learning with dynamic supervision

Incremental learning from correct partial solutions

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting