๐ค AI Summary
This work addresses the inefficiency of scaling pretraining in large language models due to the absence of feedback signals from later-stage alignment processes. To bridge this gap, the authors propose Introspective Training (IXT), a novel approach that retroactively injects post-training feedback into pretraining by leveraging a โreasoning-based reward modelโ to generate natural language critiques. These critiques are incorporated into the training data via prefix conditioning, endowing the model with early-stage quality awareness. Combining offline reward-conditioned reinforcement learning with natural language feedback, IXT achieves up to a 2.8ร improvement in computational efficiency across models ranging from 7.5B to 12B parameters trained on up to 18 trillion tokens, while significantly outperforming conventional training paradigms on mathematical and code-related tasks.
๐ Abstract
We tackle the question of how to scale more efficiently across the many, ever-growing stages of current LLM training pipelines. Our guiding intuition stems from the fact that the dynamics of later stages of the pipeline, e.g. post-training, can be used to inform earlier stages such as pre-training. To this end, we propose Introspective Training (or IXT), inspired by offline reward-conditioned reinforcement learning and applicable to any stage of training. IXT uses a thinking reward model to annotate data with natural language critique based feedback, enabling quality aware training from the earliest stages of the pipeline. Models are then trained by prefix-conditioning the data with the generated feedback -- ensuring that not all tokens are treated equally starting much earlier in training than usual. Comprehensive experiments on 7.5-12B transformer-based dense LLMs trained from scratch all the way up to 18 Trillion tokens seen show that our method: bends scaling curves resulting in up to 2.8x more compute efficiency generally; and reaches performance levels unachievable for models trained otherwise in domains such as math and code.