🤖 AI Summary
Evaluating the robustness of large language models (LLMs) via conventional red-teaming is computationally prohibitive.
Method: This paper proposes an efficient evaluation framework based on lightweight proxy metrics. It systematically validates and integrates three low-overhead proxies—embedding-space gradient perturbations, prefilling-based modeling, and direct prompt perturbations—and constructs interpretable proxy metrics using linear and Spearman correlation analyses.
Contribution/Results: The proposed proxy metrics achieve high agreement with full red-teaming attack success rates (Pearson *r*<sub>p</sub> = 0.87, Spearman *r*<sub>s</sub> = 0.94) while reducing computational cost by three orders of magnitude (1000×). This breakthrough overcomes the scalability bottleneck of traditional red-teaming, enabling rapid, low-cost, and interpretable robustness diagnostics for LLM safety deployment.
📝 Abstract
Evaluating the robustness of LLMs to adversarial attacks is crucial for safe deployment, yet current red-teaming methods are often prohibitively expensive. We compare the ability of fast proxy metrics to predict the real-world robustness of an LLM against a simulated attacker ensemble. This allows us to estimate a model's robustness to computationally expensive attacks without requiring runs of the attacks themselves. Specifically, we consider gradient-descent-based embedding-space attacks, prefilling attacks, and direct prompting. Even though direct prompting in particular does not achieve high ASR, we find that it and embedding-space attacks can predict attack success rates well, achieving $r_p=0.87$ (linear) and $r_s=0.94$ (Spearman rank) correlations with the full attack ensemble while reducing computational cost by three orders of magnitude.