Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs

📅 2026-03-25

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

Existing adversarial attack methods for large language models (LLMs) exhibit limited effectiveness in jailbreaking and prompt injection tasks. This work proposes a Claude-driven LLM agent framework that autonomously investigates and iteratively refines white-box adversarial strategies to discover novel attack algorithms. For the first time, an LLM agent independently designs an attack method that significantly outperforms over 30 baseline approaches, demonstrating strong generalization capabilities. The proposed method achieves a 40% success rate on CBRN queries—substantially higher than the ≤10% success rate of all baselines—and attains a 100% attack success rate against the Meta-SecAlign-70B model, compared to 56% for the best-performing baseline.

Technology Category

Application Category

📝 Abstract

LLM agents like Claude Code can not only write code but also be used for autonomous AI research and engineering \citep{rank2026posttrainbench, novikov2025alphaevolve}. We show that an \emph{autoresearch}-style pipeline \citep{karpathy2026autoresearch} powered by Claude Code discovers novel white-box adversarial attack \textit{algorithms} that \textbf{significantly outperform all existing (30+) methods} in jailbreaking and prompt injection evaluations. Starting from existing attack implementations, such as GCG~\citep{zou2023universal}, the agent iterates to produce new algorithms achieving up to 40\% attack success rate on CBRN queries against GPT-OSS-Safeguard-20B, compared to $\leq$10\% for existing algorithms (\Cref{fig:teaser}, left). The discovered algorithms generalize: attacks optimized on surrogate models transfer directly to held-out models, achieving \textbf{100\% ASR against Meta-SecAlign-70B} \citep{chen2025secalign} versus 56\% for the best baseline (\Cref{fig:teaser}, middle). Extending the findings of~\cite{carlini2025autoadvexbench}, our results are an early demonstration that incremental safety and security research can be automated using LLM agents. White-box adversarial red-teaming is particularly well-suited for this: existing methods provide strong starting points, and the optimization objective yields dense, quantitative feedback. We release all discovered attacks alongside baseline implementations and evaluation code at https://github.com/romovpa/claudini.

Problem

Research questions and friction points this paper is trying to address.

adversarial attack

LLM security

autonomous AI research

jailbreaking

prompt injection

Innovation

Methods, ideas, or system contributions that make the work stand out.

autoresearch

adversarial attack algorithms

LLM agents