Aligned but Fragile: Enhancing LLM Safety Robustness via Zeroth-Order Optimization

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work addresses the vulnerability of large language models to lightweight perturbations—such as parameter noise, activation perturbations, or quantization—after alignment, which can compromise their safety behaviors. To enhance robustness in safety alignment, the paper introduces zeroth-order optimization for the first time in this context and proposes a two-stage training strategy. The approach begins with standard first-order safety alignment, followed by an efficient zeroth-order fine-tuning applied selectively to critical layers identified through layer-wise robustness sensitivity analysis. Remarkably, only a few zeroth-order update steps are required to substantially improve the model’s safety stability under perturbations, while preserving its original alignment performance and maintaining low training overhead.

📝 Abstract

Safety alignment for large language models (LLMs) aims to reduce harmful or unsafe behavior while preserving general utility. However, recent findings reveal that alignment effects can be fragile: lightweight post-alignment manipulations, such as parameter noise, activation noise, or quantization, can easily weaken the intended safety behavior. Prior efforts to improve robustness have primarily focused on data curation, modified alignment objectives, and safety-critical parameter identification, leaving the role of the optimizer itself largely unexplored. In this paper, we are the first to study the robustness of safety alignment from the perspective of the base optimizer. This optimizer-centric view naturally points to zeroth-order optimization, which provides a robustness-oriented signal by evaluating safety alignment under perturbations. Based on this insight, we propose a hybrid framework that first performs standard first-order safety alignment and then applies zeroth-order refinement to improve robustness. Both theoretically and empirically, we show that only a few zeroth-order refinement steps can enhance robustness while preserving safety alignment. We further improve the efficiency of zeroth-order refinement by exploiting its inherent perturbation-based evaluations to estimate layer-wise robustness sensitivity, enabling the refinement process to concentrate updates on robustness-critical layers with modest training overhead.

Problem

Research questions and friction points this paper is trying to address.

safety alignment

robustness

large language models

fragile alignment

zeroth-order optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

zeroth-order optimization

safety alignment

robustness