🤖 AI Summary
This work proposes an alignment-aware fine-tuning framework that addresses the common oversight in existing methods—namely, their neglect of alignment objectives such as safety and hallucination mitigation—often exacerbating alignment deficiencies during task adaptation. The framework employs policy gradient regularization guided by external alignment signals to dynamically balance task performance and alignment constraints at the sample level. It further incorporates an adaptive gating mechanism to modulate gradients and enables the model to learn to abstain from responding to high-risk inputs. This approach intrinsically embeds conservative response behavior into the model without incurring additional inference overhead. Experimental results demonstrate that the framework significantly reduces harmful and hallucinatory outputs across general and domain-specific instruction-tuning benchmarks while preserving task performance, and exhibits strong robustness against adversarial fine-tuning and prompt-based attacks.
📝 Abstract
Fine-tuning is the primary mechanism for adapting foundation models to downstream tasks; however, standard approaches largely optimize task objectives in isolation and do not account for secondary yet critical alignment objectives (e.g., safety and hallucination avoidance). As a result, downstream fine-tuning can degrade alignment and fail to correct pre-existing misaligned behavior. We propose an alignment-aware fine-tuning framework that integrates feedback from an external alignment signal through policy-gradient-based regularization. Our method introduces an adaptive gating mechanism that dynamically balances supervised and alignment-driven gradients on a per-sample basis, prioritizing uncertain or misaligned cases while allowing well-aligned examples to follow standard supervised updates. The framework further learns abstention behavior for fully misaligned inputs, incorporating conservative responses directly into the fine-tuned model. Experiments on general and domain-specific instruction-tuning benchmarks demonstrate consistent reductions in harmful and hallucinated outputs without sacrificing downstream task performance. Additional analyses show robustness to adversarial fine-tuning, prompt-based attacks, and unsafe initializations, establishing adaptively gated alignment optimization as an effective approach for alignment-preserving and alignment-recovering model adaptation.