Beyond What Seems Necessary: Hidden Gains from Scaling Training-Time Reasoning Length under Outcome Supervision

📅 2026-01-31

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This work investigates how to enhance the out-of-distribution (OOD) generalization of large language models under outcome-only supervision. The authors observe that increasing inference-time reasoning length—such as the number of iterations in recurrent Transformers or token budgets in reinforcement learning fine-tuning—continues to improve OOD robustness even after in-distribution performance has saturated. They attribute this phenomenon to two mechanisms: self-iteration-induced inductive bias and suppression of reliance on spurious shortcuts. By integrating recurrent Transformer architectures with reinforcement learning-based fine-tuning, the proposed approach demonstrates significant gains in cross-distribution generalization across both synthetic and mathematical reasoning tasks, validating its effectiveness in improving model robustness beyond the training distribution.

Technology Category

Application Category

📝 Abstract

Training LLMs to think and reason for longer has become a key ingredient in building state-of-the-art models that can solve complex problems previously out of reach. Recent efforts pursue this in different ways, such as RL fine-tuning to elicit long CoT or scaling latent reasoning through architectural recurrence. This makes reasoning length an important scaling knob. In this work, we identify a novel phenomenon (both theoretically and experimentally): under outcome-only supervision, out-of-distribution (OOD) performance can continue improving as training-time reasoning length (e.g., the token budget in RL, or the loop count in looped Transformers) increases, even after in-distribution (ID) performance has saturated. This suggests that robustness may require a larger budget than ID validation alone would indicate. We provide theoretical explanations via two mechanisms: (i) self-iteration can induce a stronger inductive bias in the hypothesis class, reshaping ID-optimal solutions in ways that improve OOD generalization; and (ii) when shortcut solutions that work for ID samples but not for OOD samples persist in the hypothesis class, regularization can reduce the learned solution's reliance on these shortcuts as the number of self-iterations increases. We complement the theory with empirical evidence from two realizations of scaling training-time reasoning length: increasing the number of loops in looped Transformers on a synthetic task, and increasing token budgets during RL fine-tuning of LLMs on mathematical reasoning.

Problem

Research questions and friction points this paper is trying to address.

reasoning length

outcome supervision

out-of-distribution generalization

inductive bias

shortcut solutions

Innovation

Methods, ideas, or system contributions that make the work stand out.

reasoning length scaling

outcome supervision

out-of-distribution generalization