🤖 AI Summary
This study addresses the underexplored problem that sequentially deploying multiple defense mechanisms in large language models can inadvertently degrade or even undermine previously established safety capabilities. Through the first systematic evaluation of ordering effects across 144 defense combinations, we find that nearly 39% of deployment sequences lead to such defensive degradation, tracing its root cause to misalignment induced by parameter updates in critical model layers. To mitigate this, we introduce a layer-wise conflict scoring method based on geometric tension in activation subspaces and design a conflict-guided layer-freezing strategy that preserves both existing safety guarantees and new defense efficacy. Extensive experiments—including layer-wise representation divergence analysis, activation patching, and PCA trajectory tracking—precisely identify shared critical layers and confirm the asymmetry and sequence dependence of defense conflicts.
📝 Abstract
Large Language Models (LLMs) deployed in high-stakes applications must simultaneously manage multiple risks, yet existing defenses are almost exclusively evaluated in isolation under a one-shot deployment assumption. In practice, providers patch models incrementally throughout their lifecycle-responding to newly exposed vulnerabilities or targeted data-removal requests without retraining from scratch. This raises a fundamental but underexplored question: does a later defense preserve the protections established by an earlier one? We present the first systematic study of cross-defense interactions under sequential deployment. Evaluating 144 ordered sequences across three risk dimensions and three model families, we find that 38.9% exhibit measurable risk exacerbation on the originally defended dimension. These interactions are highly asymmetric and order-dependent. To explain these phenomena, we conduct a mechanistic analysis on representative deployment sequences. Using layer-wise representational divergence and activation patching, we localize each defense to a compact set of critical layers. In conflicting sequences, the overlapping critical layers exhibit strongly anti-aligned parameter updates, whereas benign orderings maintain near-orthogonal updates. PCA trajectory analysis reveals that defense collapse stems from activation pattern reversals in these shared layers. We further introduce a layer-wise conflict score that quantifies the geometric tension between defense-induced activation subspaces, offering mechanistic insight into the observed reversals. Guided by this diagnosis, we propose conflict-guided layer freezing, a lightweight mitigation that selectively freezes high-conflict layers during sequential deployment, preserving prior protections without degrading secondary defense performance.