Shaping capabilities with token-level data filtering

📅 2026-01-29

📈 Citations: 0

✨ Influential: 0

career value

155K/year

🤖 AI Summary

This work proposes a token-level filtering intervention to efficiently, robustly, and cost-effectively attenuate undesirable capabilities of language models in specific domains—such as healthcare—during pretraining. By integrating sparse autoencoders for fine-grained token annotation and distilling a lightweight classifier, the method precisely removes signals associated with the target domain at the token level rather than discarding entire documents. Experiments demonstrate that this approach achieves up to a 7,000-fold computational slowdown in the model’s target-domain capability on the largest evaluated model, substantially outperforming document-level filtering. Crucially, it preserves beneficial general capabilities, remains compatible with downstream alignment processes, and exhibits robustness under noisy labeling conditions.

Technology Category

Application Category

📝 Abstract

Current approaches to reducing undesired capabilities in language models are largely post hoc, and can thus be easily bypassed by adversaries. A natural alternative is to shape capabilities during pretraining itself. On the proxy task of removing medical capabilities, we show that the simple intervention of filtering pretraining data is highly effective, robust, and inexpensive at scale. Inspired by work on data attribution, we show that filtering tokens is more effective than filtering documents, achieving the same hit to undesired capabilities at a lower cost to benign ones. Training models spanning two orders of magnitude, we then demonstrate that filtering gets more effective with scale: for our largest models, token filtering leads to a 7000x compute slowdown on the forget domain. We also show that models trained with token filtering can still be aligned on the forget domain. Along the way, we introduce a methodology for labeling tokens with sparse autoencoders and distilling cheap, high-quality classifiers. We also demonstrate that filtering can be robust to noisy labels with sufficient pretraining compute.

Problem

Research questions and friction points this paper is trying to address.

capability shaping

undesired capabilities

pretraining data filtering

token-level filtering

language model safety

Innovation

Methods, ideas, or system contributions that make the work stand out.

token-level filtering

capability shaping

data attribution