Trust Me, I Can Handle It: Self-Generated Adversarial Scenario Extrapolation for Robust Language Models

📅 2025-05-20

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

Large language models (LLMs) exhibit significant vulnerabilities across multiple safety dimensions—including jailbreaking, toxicity, hallucination, and bias—while existing defenses suffer from poor generalizability and rigid policies (e.g., excessive refusal). Method: We propose the Adversarial Scenario Extrapolation (ASE) framework, a novel inference-time, self-generative defense that internalizes adversarial awareness as an intrinsic cognitive process. ASE employs chain-of-thought (CoT) reasoning to autonomously construct adversarial scenarios and dynamically derive adaptive defense strategies, augmented by a lightweight inference-time intervention mechanism. Contribution/Results: ASE uniformly mitigates diverse novel attacks without compromising response naturalness. On four adversarial benchmarks, it achieves near-zero jailbreak rates, >90% toxicity reduction, <4% refusal rate, 92–99% QA accuracy, and 4–10× reduction in bias scores—consistently outperforming six state-of-the-art methods.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) exhibit impressive capabilities, but remain susceptible to a growing spectrum of safety risks, including jailbreaks, toxic content, hallucinations, and bias. Existing defenses often address only a single threat type or resort to rigid outright rejection, sacrificing user experience and failing to generalize across diverse and novel attacks. This paper introduces Adversarial Scenario Extrapolation (ASE), a novel inference-time computation framework that leverages Chain-of-Thought (CoT) reasoning to simultaneously enhance LLM robustness and seamlessness. ASE guides the LLM through a self-generative process of contemplating potential adversarial scenarios and formulating defensive strategies before generating a response to the user query. Comprehensive evaluation on four adversarial benchmarks with four latest LLMs shows that ASE achieves near-zero jailbreak attack success rates and minimal toxicity, while slashing outright rejections to<4%. ASE outperforms six state-of-the-art defenses in robustness-seamlessness trade-offs, with 92-99% accuracy on adversarial Q&A and 4-10x lower bias scores. By transforming adversarial perception into an intrinsic cognitive process, ASE sets a new paradigm for secure and natural human-AI interaction.

Problem

Research questions and friction points this paper is trying to address.

Enhancing LLM robustness against diverse adversarial attacks

Reducing toxicity and bias in language model responses

Improving user experience by minimizing outright rejections

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-generative adversarial scenario contemplation

Chain-of-Thought reasoning for defense

Inference-time robustness-seamlessness optimization

🔎 Similar Papers

No similar papers found.