Trust Me, I Can Handle It: Self-Generated Adversarial Scenario Extrapolation for Robust Language Models

📅 2025-05-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) exhibit significant vulnerabilities across multiple safety dimensions—including jailbreaking, toxicity, hallucination, and bias—while existing defenses suffer from poor generalizability and rigid policies (e.g., excessive refusal). Method: We propose the Adversarial Scenario Extrapolation (ASE) framework, a novel inference-time, self-generative defense that internalizes adversarial awareness as an intrinsic cognitive process. ASE employs chain-of-thought (CoT) reasoning to autonomously construct adversarial scenarios and dynamically derive adaptive defense strategies, augmented by a lightweight inference-time intervention mechanism. Contribution/Results: ASE uniformly mitigates diverse novel attacks without compromising response naturalness. On four adversarial benchmarks, it achieves near-zero jailbreak rates, >90% toxicity reduction, <4% refusal rate, 92–99% QA accuracy, and 4–10× reduction in bias scores—consistently outperforming six state-of-the-art methods.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) exhibit impressive capabilities, but remain susceptible to a growing spectrum of safety risks, including jailbreaks, toxic content, hallucinations, and bias. Existing defenses often address only a single threat type or resort to rigid outright rejection, sacrificing user experience and failing to generalize across diverse and novel attacks. This paper introduces Adversarial Scenario Extrapolation (ASE), a novel inference-time computation framework that leverages Chain-of-Thought (CoT) reasoning to simultaneously enhance LLM robustness and seamlessness. ASE guides the LLM through a self-generative process of contemplating potential adversarial scenarios and formulating defensive strategies before generating a response to the user query. Comprehensive evaluation on four adversarial benchmarks with four latest LLMs shows that ASE achieves near-zero jailbreak attack success rates and minimal toxicity, while slashing outright rejections to<4%. ASE outperforms six state-of-the-art defenses in robustness-seamlessness trade-offs, with 92-99% accuracy on adversarial Q&A and 4-10x lower bias scores. By transforming adversarial perception into an intrinsic cognitive process, ASE sets a new paradigm for secure and natural human-AI interaction.
Problem

Research questions and friction points this paper is trying to address.

Enhancing LLM robustness against diverse adversarial attacks
Reducing toxicity and bias in language model responses
Improving user experience by minimizing outright rejections
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-generative adversarial scenario contemplation
Chain-of-Thought reasoning for defense
Inference-time robustness-seamlessness optimization
🔎 Similar Papers
No similar papers found.
M
Md. Rafi Ur Rashid
Pennsylvania State University
Vishnu Asutosh Dasu
Vishnu Asutosh Dasu
The Pennsylvania State University
Trustworthy Machine LearningApplied CryptographySecurity and Privacy
Y
Ye Wang
Mitsubishi Electric Research Laboratories
G
Gang Tan
Pennsylvania State University
Shagufta Mehnaz
Shagufta Mehnaz
The Pennsylvania State University
Information Security & Privacy