Introspective X Training: Feedback Conditioning Improves Scaling Across all LLM Training Stages

๐Ÿ“… 2026-05-19
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

190K/year
๐Ÿค– AI Summary
This work addresses the inefficiency of scaling pretraining in large language models due to the absence of feedback signals from later-stage alignment processes. To bridge this gap, the authors propose Introspective Training (IXT), a novel approach that retroactively injects post-training feedback into pretraining by leveraging a โ€œreasoning-based reward modelโ€ to generate natural language critiques. These critiques are incorporated into the training data via prefix conditioning, endowing the model with early-stage quality awareness. Combining offline reward-conditioned reinforcement learning with natural language feedback, IXT achieves up to a 2.8ร— improvement in computational efficiency across models ranging from 7.5B to 12B parameters trained on up to 18 trillion tokens, while significantly outperforming conventional training paradigms on mathematical and code-related tasks.
๐Ÿ“ Abstract
We tackle the question of how to scale more efficiently across the many, ever-growing stages of current LLM training pipelines. Our guiding intuition stems from the fact that the dynamics of later stages of the pipeline, e.g. post-training, can be used to inform earlier stages such as pre-training. To this end, we propose Introspective Training (or IXT), inspired by offline reward-conditioned reinforcement learning and applicable to any stage of training. IXT uses a thinking reward model to annotate data with natural language critique based feedback, enabling quality aware training from the earliest stages of the pipeline. Models are then trained by prefix-conditioning the data with the generated feedback -- ensuring that not all tokens are treated equally starting much earlier in training than usual. Comprehensive experiments on 7.5-12B transformer-based dense LLMs trained from scratch all the way up to 18 Trillion tokens seen show that our method: bends scaling curves resulting in up to 2.8x more compute efficiency generally; and reaches performance levels unachievable for models trained otherwise in domains such as math and code.
Problem

Research questions and friction points this paper is trying to address.

scaling
LLM training
training efficiency
compute efficiency
training pipeline
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introspective Training
feedback conditioning
reward-conditioned reinforcement learning
quality-aware training
scaling efficiency