Ellipsoid Control: A White-list Jailbreak Defense via Benign Latent Modeling

📅 2026-05-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes a whitelist-based test-time defense mechanism that addresses the limitations of existing blacklist approaches, which rely on finite and dynamically evolving harmful examples and consequently suffer from poor generalization and high false rejection rates. Instead of using harmful samples, our method models the geometric structure of the benign latent space using abundant harmless data and employs an anisotropic ellipsoidal constraint to guide model updates. This approach triggers rejection responses to adversarial inputs while minimally perturbing normal functionality. By pioneering the use of latent-variable ellipsoid geometry for defense optimization, the method jointly enhances security and utility. Empirical evaluations demonstrate that it significantly improves robustness across multiple large language models and jailbreaking attack scenarios, while better preserving performance on benign tasks.
📝 Abstract
Representation engineering (RepE) defenses have shown strong robustness against jailbreak attacks on large language models (LLMs). However, these methods fundamentally rely on black-list supervision: they learn jailbreak-to-refusal activation transformations from harmful or jailbreak data that are inherently incomplete and continuously evolving. Hence, the performance of RepE-based defenses becomes tightly coupled to the quality and coverage of collected harmful samples, leaving models vulnerable to unseen attacks. This reliance also obscures the distinction between defenses that fit known harmful distributions and defenses that protect a benign latent region without estimating the harmful distribution. We adopt the opposite, the white-list perspective, by leveraging the accessibility and abundance of benign data. The goal is to elicit refusal on arbitrary inputs while ensuring that harmless inputs are not falsely rejected. This shifts the core research question to: How can we design a robust benign-latent preservation mechanism such that the benign latent distribution remains intact while refusal is elicited? To answer this, we propose Ellipsoid Control, a test-time defense. It performs projected gradient descent that can elicit refusal on arbitrary inputs, aiming to improve defense effectiveness. At the same time, an anisotropic benign-geometry ellipsoid is fitted from abundant benign data to constrain the update to minimize distortion of the benign latent geometry. This tight constraint helps preserve model utility. Across multiple LLMs, jailbreak attacks, benign tasks, and safety-boundary evaluations, Ellipsoid Control consistently enhances safety while better preserving utility, demonstrating the effectiveness of the white-list approach for jailbreak defense
Problem

Research questions and friction points this paper is trying to address.

jailbreak defense
white-list approach
benign latent modeling
representation engineering
LLM safety
Innovation

Methods, ideas, or system contributions that make the work stand out.

white-list defense
benign latent modeling
Ellipsoid Control
jailbreak robustness
representation engineering