🤖 AI Summary
Large language models (LLMs) exhibit significant vulnerabilities across multiple safety dimensions—including jailbreaking, toxicity, hallucination, and bias—while existing defenses suffer from poor generalizability and rigid policies (e.g., excessive refusal). Method: We propose the Adversarial Scenario Extrapolation (ASE) framework, a novel inference-time, self-generative defense that internalizes adversarial awareness as an intrinsic cognitive process. ASE employs chain-of-thought (CoT) reasoning to autonomously construct adversarial scenarios and dynamically derive adaptive defense strategies, augmented by a lightweight inference-time intervention mechanism. Contribution/Results: ASE uniformly mitigates diverse novel attacks without compromising response naturalness. On four adversarial benchmarks, it achieves near-zero jailbreak rates, >90% toxicity reduction, <4% refusal rate, 92–99% QA accuracy, and 4–10× reduction in bias scores—consistently outperforming six state-of-the-art methods.
📝 Abstract
Large Language Models (LLMs) exhibit impressive capabilities, but remain susceptible to a growing spectrum of safety risks, including jailbreaks, toxic content, hallucinations, and bias. Existing defenses often address only a single threat type or resort to rigid outright rejection, sacrificing user experience and failing to generalize across diverse and novel attacks. This paper introduces Adversarial Scenario Extrapolation (ASE), a novel inference-time computation framework that leverages Chain-of-Thought (CoT) reasoning to simultaneously enhance LLM robustness and seamlessness. ASE guides the LLM through a self-generative process of contemplating potential adversarial scenarios and formulating defensive strategies before generating a response to the user query. Comprehensive evaluation on four adversarial benchmarks with four latest LLMs shows that ASE achieves near-zero jailbreak attack success rates and minimal toxicity, while slashing outright rejections to<4%. ASE outperforms six state-of-the-art defenses in robustness-seamlessness trade-offs, with 92-99% accuracy on adversarial Q&A and 4-10x lower bias scores. By transforming adversarial perception into an intrinsic cognitive process, ASE sets a new paradigm for secure and natural human-AI interaction.