🤖 AI Summary
This study addresses the alignment failure problem in large language models (LLMs), systematically exposing their safety vulnerabilities. We propose a test-time adversarial reasoning framework that—uniquely—introduces dynamic computational optimization into jailbreaking attacks: it generates high-success-rate jailbreaking prompts *during inference* via multi-step policy optimization and standardized prompt synthesis, eliminating reliance on static prompt engineering or black-box search. This establishes a novel paradigm for diagnosing LLM robustness deficiencies. Experiments demonstrate state-of-the-art attack success rates (ASRs) across multiple strongly aligned LLMs—including models explicitly designed for inference-time robustness. Our approach thus provides a scalable, interpretable technical pathway for AI safety evaluation and alignment hardening.
📝 Abstract
As large language models (LLMs) are becoming more capable and widespread, the study of their failure cases is becoming increasingly important. Recent advances in standardizing, measuring, and scaling test-time compute suggest new methodologies for optimizing models to achieve high performance on hard tasks. In this paper, we apply these advances to the task of model jailbreaking: eliciting harmful responses from aligned LLMs. We develop an adversarial reasoning approach to automatic jailbreaking via test-time computation that achieves SOTA attack success rates (ASR) against many aligned LLMs, even the ones that aim to trade inference-time compute for adversarial robustness. Our approach introduces a new paradigm in understanding LLM vulnerabilities, laying the foundation for the development of more robust and trustworthy AI systems.