Introspective X Training: Feedback Conditioning Improves Scaling Across all LLM Training Stages

📅 2026-05-19

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This work addresses the inefficiency of scaling pretraining in large language models due to the absence of feedback signals from later-stage alignment processes. To bridge this gap, the authors propose Introspective Training (IXT), a novel approach that retroactively injects post-training feedback into pretraining by leveraging a “reasoning-based reward model” to generate natural language critiques. These critiques are incorporated into the training data via prefix conditioning, endowing the model with early-stage quality awareness. Combining offline reward-conditioned reinforcement learning with natural language feedback, IXT achieves up to a 2.8× improvement in computational efficiency across models ranging from 7.5B to 12B parameters trained on up to 18 trillion tokens, while significantly outperforming conventional training paradigms on mathematical and code-related tasks.

📝 Abstract

We tackle the question of how to scale more efficiently across the many, ever-growing stages of current LLM training pipelines. Our guiding intuition stems from the fact that the dynamics of later stages of the pipeline, e.g. post-training, can be used to inform earlier stages such as pre-training. To this end, we propose Introspective Training (or IXT), inspired by offline reward-conditioned reinforcement learning and applicable to any stage of training. IXT uses a thinking reward model to annotate data with natural language critique based feedback, enabling quality aware training from the earliest stages of the pipeline. Models are then trained by prefix-conditioning the data with the generated feedback -- ensuring that not all tokens are treated equally starting much earlier in training than usual. Comprehensive experiments on 7.5-12B transformer-based dense LLMs trained from scratch all the way up to 18 Trillion tokens seen show that our method: bends scaling curves resulting in up to 2.8x more compute efficiency generally; and reaches performance levels unachievable for models trained otherwise in domains such as math and code.

Problem

Research questions and friction points this paper is trying to address.

scaling

LLM training

training efficiency

compute efficiency

training pipeline

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introspective Training

feedback conditioning

reward-conditioned reinforcement learning