Capability-Based Scaling Laws for LLM Red-Teaming

📅 2025-05-26

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

This work addresses the capability mismatch problem in LLM red-teaming—specifically, “weak attackers versus strong target models”—by proposing the first jailbreak attack scaling law grounded in capability gaps. We design an LLM-driven, multi-strategy jailbreak framework and conduct systematic evaluation across 500+ attacker–target pairs spanning diverse model scales and capability tiers. Fine-grained capability characterization is achieved using the MMLU-Pro social sciences seed set. Our key findings are: (1) jailbreak success rate exhibits a strong negative correlation with attacker–target capability difference; (2) we quantitatively identify, for the first time, the critical capability inversion point beyond which attacks fail; and (3) we demonstrate that fixed-capability red-team agents—such as humans—systematically underperform against increasingly capable targets. These results establish a predictive, quantifiable metric for red-teaming efficacy, advancing rigorous, scalable safety evaluation of LLMs.

Technology Category

Application Category

📝 Abstract

As large language models grow in capability and agency, identifying vulnerabilities through red-teaming becomes vital for safe deployment. However, traditional prompt-engineering approaches may prove ineffective once red-teaming turns into a weak-to-strong problem, where target models surpass red-teamers in capabilities. To study this shift, we frame red-teaming through the lens of the capability gap between attacker and target. We evaluate more than 500 attacker-target pairs using LLM-based jailbreak attacks that mimic human red-teamers across diverse families, sizes, and capability levels. Three strong trends emerge: (i) more capable models are better attackers, (ii) attack success drops sharply once the target's capability exceeds the attacker's, and (iii) attack success rates correlate with high performance on social science splits of the MMLU-Pro benchmark. From these trends, we derive a jailbreaking scaling law that predicts attack success for a fixed target based on attacker-target capability gap. These findings suggest that fixed-capability attackers (e.g., humans) may become ineffective against future models, increasingly capable open-source models amplify risks for existing systems, and model providers must accurately measure and control models' persuasive and manipulative abilities to limit their effectiveness as attackers.

Problem

Research questions and friction points this paper is trying to address.

Studying red-teaming effectiveness as LLMs surpass human capabilities

Analyzing attack success based on attacker-target capability gaps

Predicting jailbreak risks for future models using scaling laws

Innovation

Methods, ideas, or system contributions that make the work stand out.

Capability gap analysis for red-teaming effectiveness

LLM-based jailbreak attacks mimic human red-teamers

Jailbreaking scaling law predicts attack success

🔎 Similar Papers

Scaling Trends in Language Model Robustness