🤖 AI Summary
This work addresses the safety degradation of language models during instruction tuning, a common issue where existing defenses often compromise utility or provide insufficient protection. The authors propose an adaptive regularization framework that dynamically adjusts alignment strength during fine-tuning based on real-time safety risk: high-risk parameter updates are constrained to remain close to a safe reference policy, while low-risk updates proceed normally. This approach is the first to enable training-time adaptive alignment guided by safety signals, leveraging a Safety Critic score and a lightweight harmful intent classifier to assess risk without incurring additional inference overhead. Experiments across diverse models and attack scenarios demonstrate that the method significantly reduces attack success rates while preserving downstream task performance, validating that pre-generation activation patterns effectively predict safety intent and that critic-based scoring offers high-recall guidance for alignment.
📝 Abstract
Instruction-following language models are trained to be helpful and safe, yet their safety behavior can deteriorate under benign fine-tuning and worsen under adversarial updates. Existing defenses often offer limited protection or force a trade-off between safety and utility. We introduce a training framework that adapts regularization in response to safety risk, enabling models to remain aligned throughout fine-tuning. To estimate safety risk at training time, we explore two distinct approaches: a judge-based Safety Critic that assigns high-level harm scores to training batches, and an activation-based risk predictor built with a lightweight classifier trained on intermediate model activations to estimate harmful intent. Each approach provides a risk signal that is used to constrain updates deemed higher risk to remain close to a safe reference policy, while lower-risk updates proceed with standard training. We empirically verify that harmful intent signals are predictable from pre-generation activations and that judge scores provide effective high-recall safety guidance. Across multiple model families and attack scenarios, adaptive regularization with either risk estimation approach consistently lowers attack success rate compared to standard fine-tuning, preserves downstream performance, and adds no inference-time cost. This work demonstrates a principled mechanism for maintaining safety without sacrificing utility.