Model Spec Midtraining: Improving How Alignment Training Generalizes

📅 2026-05-03
📈 Citations: 0
Influential: 0
📄 PDF

career value

199K/year
🤖 AI Summary
This work addresses the limitations of standard alignment fine-tuning, which often yields shallow alignment and poor generalization due to its reliance on demonstration data that inadequately captures the underlying norms governing desired behavior. To overcome this, the authors propose Model Spec Midtraining (MSM), a method that injects synthetic documents encoding explicit normative knowledge—defined by a Model Spec—between pretraining and alignment fine-tuning. This intermediate training stage enables the model to internalize structured norms and thereby generalize target behaviors more effectively from subsequent demonstrations. The study demonstrates that the formulation of norm statements, particularly those incorporating concreteness and value-based explanations, critically influences generalization performance. Evaluated on Qwen3-32B, MSM reduces agentive misalignment rates from 54% to 7%, substantially outperforming deliberative alignment baselines (14%).
📝 Abstract
Some frontier AI developers aim to align language models to a Model Spec or Constitution that describes the intended model behavior. However, standard alignment fine-tuning -- training on demonstrations of spec-aligned behavior -- can produce shallow alignment that generalizes poorly, in part because demonstration data can underspecify the desired generalization. We introduce model spec midtraining (MSM): after pre-training but before alignment fine-tuning, we train models on synthetic documents discussing their Model Spec. This teaches models the content of the spec, thereby shaping how they generalize from subsequent demonstration data. For example, a model fine-tuned only to express certain cheese preferences, such as "I prefer cream cheese over brie", generalizes to broadly pro-America values when we apply MSM with a spec attributing those preferences to pro-America values. Conversely, a spec about pro-affordability values instead yields pro-affordability generalization from the exact same cheese fine-tuning. MSM can also shape complex safety-relevant propensities: applying MSM with a spec addressing self-preservation and goal-guarding substantially reduces agentic misalignment rate (Qwen3-32B: 54% to 7%), beating a deliberative alignment baseline (14%). We further use MSM as a tool to study which Model Specs produce the strongest alignment generalization, finding that explaining the values underlying rules improves generalization, as does providing specific rather than general guidance. Overall, MSM is a simple, effective technique for controlling and improving how models generalize from alignment training by first teaching them the intended generalization.
Problem

Research questions and friction points this paper is trying to address.

alignment generalization
Model Spec
language model alignment
spec underspecification
behavioral generalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Model Spec Midtraining
alignment generalization
synthetic specification training
value grounding
agentic misalignment
🔎 Similar Papers
2024-06-05arXiv.orgCitations: 1