Fast Proxies for LLM Robustness Evaluation

📅 2025-02-14

📈 Citations: 0

✨ Influential: 0

career value

231K/year

🤖 AI Summary

Evaluating the robustness of large language models (LLMs) via conventional red-teaming is computationally prohibitive. Method: This paper proposes an efficient evaluation framework based on lightweight proxy metrics. It systematically validates and integrates three low-overhead proxies—embedding-space gradient perturbations, prefilling-based modeling, and direct prompt perturbations—and constructs interpretable proxy metrics using linear and Spearman correlation analyses. Contribution/Results: The proposed proxy metrics achieve high agreement with full red-teaming attack success rates (Pearson *r*<sub>p</sub> = 0.87, Spearman *r*<sub>s</sub> = 0.94) while reducing computational cost by three orders of magnitude (1000×). This breakthrough overcomes the scalability bottleneck of traditional red-teaming, enabling rapid, low-cost, and interpretable robustness diagnostics for LLM safety deployment.

Technology Category

Application Category

📝 Abstract

Evaluating the robustness of LLMs to adversarial attacks is crucial for safe deployment, yet current red-teaming methods are often prohibitively expensive. We compare the ability of fast proxy metrics to predict the real-world robustness of an LLM against a simulated attacker ensemble. This allows us to estimate a model's robustness to computationally expensive attacks without requiring runs of the attacks themselves. Specifically, we consider gradient-descent-based embedding-space attacks, prefilling attacks, and direct prompting. Even though direct prompting in particular does not achieve high ASR, we find that it and embedding-space attacks can predict attack success rates well, achieving $r_p=0.87$ (linear) and $r_s=0.94$ (Spearman rank) correlations with the full attack ensemble while reducing computational cost by three orders of magnitude.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM robustness to adversarial attacks

Reducing computational cost of robustness evaluation

Predicting attack success rates using proxy metrics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fast proxy metrics

Embedding-space attacks

Reduced computational cost

🔎 Similar Papers

Robust LLM safeguarding via refusal feature adversarial training