🤖 AI Summary
This work addresses the opacity and limited controllability of current safety alignment methods, which implicitly encode safe behaviors within model parameters. To overcome these limitations, the authors propose inserting a discrete information bottleneck between Transformer layers, featuring an explicit safety bit that enables interpretable and human-controllable safety decisions. By leveraging contrastive training and disentangled representations, the approach preserves the model’s semantic generation capabilities while making safety judgments transparent and modifiable. The method requires only lightweight, modular fine-tuning and demonstrates exceptional robustness in red-teaming evaluations, achieving near-zero attack success rates—significantly outperforming both baseline models and conventional safety fine-tuning techniques.
📝 Abstract
Current safety alignment methods encode safe behavior implicitly within model parameters, creating a fundamental opacity: we cannot easily inspect why a model refuses a request, nor intervene when its safety judgments fail. We propose Safe Transformer, a modular approach that augments pre-trained language models by inserting a discrete information bottleneck containing an explicit safety bit between transformer layers. The safety bit serves as both an interpretable signal of the model's safety classification and a controllable switch: through contrastive training, the model learns disentangled representations where the safety bit governs the behavioral mode - producing helpful responses when $s=1$ and refusals when $s=0$ - while additional unsupervised bits $u$ encode semantic content for generation. Additional unsupervised bits in the information bottleneck allow semantic information to flow through, preserving the model's generation capabilities. This design achieves both interpretability (the safety decision is directly readable) and controllability (the safety bit can be manually overridden), requiring only lightweight fine-tuning without pre-training from scratch. In red-team benchmarks, Safe Transformer achieves near-zero Attack Success Rate, substantially outperforming base models and safety fine-tuning baselines.