🤖 AI Summary
This work uncovers a critical geometric blind spot in large language model (LLM) alignment: adversarial prompts exploit “latent camouflage” by embedding harmful intentions in latent-space regions proximal to—but distinct from—the safe representation manifold, thereby evading mainstream defenses such as DPO. To address this, we introduce ALKALI, the first geometry-aware adversarial benchmark comprising 9,000 samples across 15 attack categories. We further propose GRACE, a geometrically informed alignment framework integrating layer-aware representation disentanglement, adversarial behavior condensation, and latent-space regularization. Additionally, we define AVQI—a geometrically grounded evaluation metric quantifying alignment vulnerability. Extensive evaluation across 21 state-of-the-art LLMs demonstrates that GRACE reduces attack success rates by up to 39% on average, while AVQI precisely pinpoints alignment failures. The code and benchmark are publicly released.
📝 Abstract
Adversarial threats against LLMs are escalating faster than current defenses can adapt. We expose a critical geometric blind spot in alignment: adversarial prompts exploit latent camouflage, embedding perilously close to the safe representation manifold while encoding unsafe intent thereby evading surface level defenses like Direct Preference Optimization (DPO), which remain blind to the latent geometry. We introduce ALKALI, the first rigorously curated adversarial benchmark and the most comprehensive to date spanning 9,000 prompts across three macro categories, six subtypes, and fifteen attack families. Evaluation of 21 leading LLMs reveals alarmingly high Attack Success Rates (ASRs) across both open and closed source models, exposing an underlying vulnerability we term latent camouflage, a structural blind spot where adversarial completions mimic the latent geometry of safe ones. To mitigate this vulnerability, we introduce GRACE - Geometric Representation Aware Contrastive Enhancement, an alignment framework coupling preference learning with latent space regularization. GRACE enforces two constraints: latent separation between safe and adversarial completions, and adversarial cohesion among unsafe and jailbreak behaviors. These operate over layerwise pooled embeddings guided by a learned attention profile, reshaping internal geometry without modifying the base model, and achieve up to 39% ASR reduction. Moreover, we introduce AVQI, a geometry aware metric that quantifies latent alignment failure via cluster separation and compactness. AVQI reveals when unsafe completions mimic the geometry of safe ones, offering a principled lens into how models internally encode safety. We make the code publicly available at https://anonymous.4open.science/r/alkali-B416/README.md.