Guiding LLM Post-training Data Engineering with Model Internals from Sparse Autoencoders

📅 2026-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation of existing post-training data engineering approaches, which predominantly rely on external metrics while neglecting internal signals from large language models. The authors propose SAERL, a novel framework that systematically leverages internal representations extracted by sparse autoencoders (SAEs) to jointly model data diversity, difficulty, and quality through SAE-space clustering, difficulty proxies, and quality probes. These signals guide batch mixing, curriculum sequencing, and data filtering in a lightweight manner. SAERL is highly transferable across model families and scales, achieving a 3.00% average accuracy gain on Qwen2.5-Math-1.5B, reducing the training steps required to reach target performance by 20%, and consistently improving results across varying model sizes and reinforcement learning algorithms.
📝 Abstract
Model internals encode rich information about how a large language model (LLM) processes its training data; however, post-training data engineering largely relies on external signals and ignores rich intrinsic signals lying in model internals. We propose SAERL, a data engineering framework for LLM reinforcement learning (RL). It models three intrinsic data properties: diversity, difficulty, and quality, using model internals extracted with Sparse Autoencoder (SAE), an advanced mechanistic interpretability tool. Each property grounds a concrete data engineering operation: SAE-space clustering with moderate batch mixing for batch diversity control, a difficulty proxy for easy-to-hard curriculum ordering, and a quality probe for data filtering. SAERL improves average accuracy by 3.00% over vanilla GRPO and reaches target accuracy with 20% fewer training steps on Qwen2.5-Math-1.5B, with consistent gains across model scales and RL algorithms. Experiments show that SAE transfers effectively across model families and scales, serving as a lightweight and reusable data engineering tool. These results demonstrate that model internals are a powerful and practical source of signals for post-training data engineering.
Problem

Research questions and friction points this paper is trying to address.

post-training data engineering
model internals
large language models
Sparse Autoencoders
reinforcement learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse Autoencoder
Data Engineering
Model Internals
Reinforcement Learning
Curriculum Learning
🔎 Similar Papers
No similar papers found.