SPINAL - Scaling-law and Preference Integration in Neural Alignment Layers

📅 2026-01-08

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

158K/year

🤖 AI Summary

This work addresses the lack of geometric characterization in existing Direct Preference Optimization (DPO) methods, which hinders the auditability and predictability of alignment outcomes in large language models. The authors propose SPINAL, a novel framework that reveals— for the first time—the geometric locality of preference optimization, showing it predominantly occurs in the final few decoder layers. By introducing a contraction score (quantifying spectral tail decay) and a transport score (measuring inter-layer distribution overlap), SPINAL constructs a quantifiable and auditable trajectory of deep alignment. Experiments demonstrate that aligned models exhibit enhanced contraction and smoother transport in deeper layers, whereas misaligned models display high path curvature, elevated entropy, and geometric inconsistency, thereby validating SPINAL’s efficacy in precisely localizing both the position and strength of alignment.

Technology Category

Application Category

📝 Abstract

Direct Preference Optimization (DPO) is a principled, scalable alternative to RLHF for aligning large language models from pairwise preferences, but its internal geometric footprint remains undercharacterized, limiting audits, checkpoint comparisons, and failure prediction. We introduce SPINAL (Scaling-law and Preference Integration in Neural Alignment Layers), a diagnostic that measures how alignment reshapes representations across depth by tracing localized structural change layer by layer. Across model families, DPO produces a layerwise calibration effect concentrated in the final decoder blocks (often layers 21-30), where preference gradients most directly affect the next-token distribution. SPINAL encodes each checkpoint as a depth trace over (layer index, contraction score, transport score). The contraction score summarizes how quickly the tail of a layer's spectrum decays (how fast small modes vanish); higher values indicate stronger contraction into fewer effective directions. The transport score summarizes how much the token distribution shifts between adjacent layers using a bounded overlap measure; lower values indicate shorter, smoother steps through representation space. Aligned checkpoints show a late-layer ramp-up in contraction and a smooth reduction in transport, consistent with tightened and stabilized policy mass, while unaligned models trace higher-curvature, more entropic, and geometrically incoherent depth paths. Overall, alignment is geometrically localized: the final layers encode the dominant preference-induced corrections. SPINAL turns this localization into a practical audit signal, quantifying where alignment concentrates, how strongly it manifests, and when it begins to destabilize during training.

Problem

Research questions and friction points this paper is trying to address.

Direct Preference Optimization

neural alignment

geometric characterization

representation geometry

model auditing

Innovation

Methods, ideas, or system contributions that make the work stand out.

SPINAL

Direct Preference Optimization

representation geometry