Adversarial Reasoning at Jailbreaking Time

📅 2025-02-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the alignment failure problem in large language models (LLMs), systematically exposing their safety vulnerabilities. We propose a test-time adversarial reasoning framework that—uniquely—introduces dynamic computational optimization into jailbreaking attacks: it generates high-success-rate jailbreaking prompts *during inference* via multi-step policy optimization and standardized prompt synthesis, eliminating reliance on static prompt engineering or black-box search. This establishes a novel paradigm for diagnosing LLM robustness deficiencies. Experiments demonstrate state-of-the-art attack success rates (ASRs) across multiple strongly aligned LLMs—including models explicitly designed for inference-time robustness. Our approach thus provides a scalable, interpretable technical pathway for AI safety evaluation and alignment hardening.

Technology Category

Application Category

📝 Abstract
As large language models (LLMs) are becoming more capable and widespread, the study of their failure cases is becoming increasingly important. Recent advances in standardizing, measuring, and scaling test-time compute suggest new methodologies for optimizing models to achieve high performance on hard tasks. In this paper, we apply these advances to the task of model jailbreaking: eliciting harmful responses from aligned LLMs. We develop an adversarial reasoning approach to automatic jailbreaking via test-time computation that achieves SOTA attack success rates (ASR) against many aligned LLMs, even the ones that aim to trade inference-time compute for adversarial robustness. Our approach introduces a new paradigm in understanding LLM vulnerabilities, laying the foundation for the development of more robust and trustworthy AI systems.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Intentional Error Generation
AI Safety and Reliability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated Error Induction
Large Language Models
Computational Stress Testing
🔎 Similar Papers
No similar papers found.