Alignment-Aware Model Adaptation via Feedback-Guided Optimization

📅 2026-02-02

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This work proposes an alignment-aware fine-tuning framework that addresses the common oversight in existing methods—namely, their neglect of alignment objectives such as safety and hallucination mitigation—often exacerbating alignment deficiencies during task adaptation. The framework employs policy gradient regularization guided by external alignment signals to dynamically balance task performance and alignment constraints at the sample level. It further incorporates an adaptive gating mechanism to modulate gradients and enables the model to learn to abstain from responding to high-risk inputs. This approach intrinsically embeds conservative response behavior into the model without incurring additional inference overhead. Experimental results demonstrate that the framework significantly reduces harmful and hallucinatory outputs across general and domain-specific instruction-tuning benchmarks while preserving task performance, and exhibits strong robustness against adversarial fine-tuning and prompt-based attacks.

Technology Category

Application Category

📝 Abstract

Fine-tuning is the primary mechanism for adapting foundation models to downstream tasks; however, standard approaches largely optimize task objectives in isolation and do not account for secondary yet critical alignment objectives (e.g., safety and hallucination avoidance). As a result, downstream fine-tuning can degrade alignment and fail to correct pre-existing misaligned behavior. We propose an alignment-aware fine-tuning framework that integrates feedback from an external alignment signal through policy-gradient-based regularization. Our method introduces an adaptive gating mechanism that dynamically balances supervised and alignment-driven gradients on a per-sample basis, prioritizing uncertain or misaligned cases while allowing well-aligned examples to follow standard supervised updates. The framework further learns abstention behavior for fully misaligned inputs, incorporating conservative responses directly into the fine-tuned model. Experiments on general and domain-specific instruction-tuning benchmarks demonstrate consistent reductions in harmful and hallucinated outputs without sacrificing downstream task performance. Additional analyses show robustness to adversarial fine-tuning, prompt-based attacks, and unsafe initializations, establishing adaptively gated alignment optimization as an effective approach for alignment-preserving and alignment-recovering model adaptation.

Problem

Research questions and friction points this paper is trying to address.

model alignment

fine-tuning

safety

hallucination avoidance

foundation models

Innovation

Methods, ideas, or system contributions that make the work stand out.

alignment-aware fine-tuning

feedback-guided optimization

adaptive gating