🤖 AI Summary
This work addresses a critical limitation in existing activation-based rejection mechanisms, which oversimplify rejection behavior as a one-dimensional direction and thereby neglect the high-dimensional structure of activation distributions, making them vulnerable to adversarial bypass. To overcome this, the study introduces optimal transport theory into rejection mechanisms for the first time, leveraging PCA for dimensionality reduction and closed-form Gaussian optimal transport to holistically align the activation distribution of harmful queries with that of benign ones. A layer-selective intervention strategy is further proposed, revealing through experiments that rejection behavior is concentrated in just one or two middle layers—challenging the conventional practice of network-wide directional ablation. Evaluated across six mainstream large language models, the method increases attack success rates by up to 11% while maintaining comparable perplexity, effectively weakening safety-aligned rejection without compromising linguistic capability.
📝 Abstract
Safety-aligned language models refuse harmful requests through learned refusal behaviors encoded in their internal representations. Recent activation-based jailbreaking methods circumvent these safety mechanisms by applying orthogonal projections to remove refusal directions, but these approaches treat refusal as a one-dimensional phenomenon and ignore the rich distributional structure of model activations. We introduce a principled framework based on optimal transport theory that transforms the entire distribution of harmful activations to match harmless ones. By combining PCA with closed-form Gaussian optimal transport, we achieve efficient computation in high-dimensional representation spaces while preserving essential geometric structure. Across six models (Llama-2, Llama-3.1, Qwen-2.5; 7B-32B parameters), our method achieves up to 11% higher attack success rates than state-of-the-art baselines while maintaining comparable perplexity, demonstrating superior preservation of model capabilities. Critically, we discover that layer-selective intervention (applying optimal transport to 1-2 carefully chosen layers at approximately 40-60% network depth) substantially outperforms full-network interventions, revealing that refusal mechanisms may be localized rather than distributed. Our analysis provides new insights into the geometric structure of safety representations and suggests that current alignment methods may be vulnerable to distributional attacks beyond simple direction removal.