Safety-Utility Conflicts Are Not Global: Surgical Alignment via Head-Level Diagnosis

📅 2026-01-07
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the performance degradation of large language models in safety alignment, which often arises from conflicting objectives between safety and general capabilities. The authors propose Conflict-Aware Sparse Tuning (CAST), a novel framework that reveals—for the first time—that such conflicts are unevenly distributed across Transformer attention heads. CAST constructs a head-level conflict map and integrates functional sensitivity analysis to identify high-conflict heads, selectively skipping their updates during fine-tuning. Experimental results demonstrate that CAST effectively preserves strong safety alignment while significantly mitigating the degradation of general capabilities, thereby achieving a superior trade-off between safety and utility.

Technology Category

Application Category

📝 Abstract
Safety alignment in Large Language Models (LLMs) inherently presents a multi-objective optimization conflict, often accompanied by an unintended degradation of general capabilities. Existing mitigation strategies typically rely on global gradient geometry to resolve these conflicts, yet they overlook Modular Heterogeneity within Transformers, specifically that the functional sensitivity and degree of conflict vary substantially across different attention heads. Such global approaches impose uniform update rules across all parameters, often resulting in suboptimal trade-offs by indiscriminately updating utility sensitive heads that exhibit intense gradient conflicts. To address this limitation, we propose Conflict-Aware Sparse Tuning (CAST), a framework that integrates head-level diagnosis with sparse fine-tuning. CAST first constructs a pre-alignment conflict map by synthesizing Optimization Conflict and Functional Sensitivity, which then guides the selective update of parameters. Experiments reveal that alignment conflicts in LLMs are not uniformly distributed. We find that the drop in general capabilities mainly comes from updating a small group of ``high-conflict''heads. By simply skipping these heads during training, we significantly reduce this loss without compromising safety, offering an interpretable and parameter-efficient approach to improving the safety-utility trade-off.
Problem

Research questions and friction points this paper is trying to address.

safety-utility conflict
large language models
alignment
modular heterogeneity
attention heads
Innovation

Methods, ideas, or system contributions that make the work stand out.

head-level diagnosis
safety-utility trade-off
sparse fine-tuning
modular heterogeneity
conflict-aware tuning
🔎 Similar Papers
No similar papers found.
W
Wang Cai
Baidu Inc.
Yilin Wen
Yilin Wen
The University of Tokyo
Computer VisionRobotics
J
Jinchang Hou
Baidu Inc.
Du Su
Du Su
Assistant Researcher, CAS Key Laboratory of AI Safety
AI safety
G
Guoqiu Wang
Baidu Inc.
Z
Zhonghou Lv
Baidu Inc.
C
Chenfu Bao
Baidu Inc.
Yunfang Wu
Yunfang Wu
Peking University
NLP