Keep Calm and Avoid Harmful Content: Concept Alignment and Latent Manipulation Towards Safer Answers

📅 2025-10-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) are vulnerable to jailbreaking attacks, and existing safety mechanisms are frequently bypassed by adversarial prompts. To address this, we propose a fine-tuning-free, inference-time safety enhancement method. Our approach identifies harmful directions in the final-layer hidden space via concept alignment and precisely suppresses their representations using orthogonal projection. Inspired by contrastive perturbation techniques from computer vision, we adapt this principle to LLM safety and integrate it with the Carlini-Wagner (CW) optimization framework for efficient perturbation generation. Experiments demonstrate that our method significantly reduces harmful output rates—by an average of 42.3%—outperforming mainstream baselines across multiple benchmarks. It incurs only ~8% additional inference overhead, offering a lightweight, highly compatible, and strongly generalizable solution without requiring model modification or retraining.

Technology Category

Application Category

📝 Abstract
Large Language Models are susceptible to jailbreak attacks that bypass built-in safety guardrails (e.g., by tricking the model with adversarial prompts). We propose Concept Alignment and Concept Manipulation extbf{CALM}, an inference-time method that suppresses harmful concepts by modifying latent representations of the last layer of the model, without retraining. Leveraging gls*{cw} technique from Computer Vision combined with orthogonal projection, CALM removes unwanted latent directions associated with harmful content while preserving model performance. Experiments show that CALM reduces harmful outputs and outperforms baseline methods in most metrics, offering a lightweight approach to AI safety with no additional training data or model fine-tuning, while incurring only a small computational overhead at inference.
Problem

Research questions and friction points this paper is trying to address.

Preventing jailbreak attacks that bypass LLM safety guardrails
Suppressing harmful concepts through latent representation manipulation
Reducing harmful outputs without retraining or fine-tuning models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Modifies latent representations without retraining
Removes harmful content directions via orthogonal projection
Uses CW technique from computer vision for safety
🔎 Similar Papers
No similar papers found.