Selective Adversarial Attacks on LLM Benchmarks

📅 2025-10-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work reveals that LLM benchmarks (e.g., MMLU) are vulnerable to *selective adversarial attacks* that preserve semantic meaning: minimal input perturbations can significantly degrade or boost the performance of specific models while leaving others nearly unaffected—thereby distorting leaderboard rankings and undermining fairness, reproducibility, and transparency in evaluation. To address this, we formally define and implement the first selective adversarial attack framework for LLM benchmarks. Our contributions are threefold: (1) a quantifiable selectivity evaluation protocol; (2) a dual-constraint optimization mechanism balancing semantic consistency and model response divergence; and (3) an efficient perturbation generation pipeline integrating proxy LLMs and the TextAttack framework. Experiments demonstrate reliable rank reversal across models, exposing the high sensitivity of current benchmarks to fine-grained edits. This underscores the urgent need for perturbation-aware reporting and robustness diagnostics as a new paradigm in LLM evaluation.

Technology Category

Application Category

📝 Abstract
Benchmarking outcomes increasingly govern trust, selection, and deployment of LLMs, yet these evaluations remain vulnerable to semantically equivalent adversarial perturbations. Prior work on adversarial robustness in NLP has emphasized text attacks that affect many models equally, leaving open the question of whether it is possible to selectively degrade or enhance performance while minimally affecting other models. We formalize this problem and study selective adversarial attacks on MMLU - a widely used benchmark designed to measure a language model's broad general knowledge and reasoning ability across different subjects. Using canonical attacks integrated into TextAttack framework, we introduce a protocol for selectivity assessment, develop a custom constraint to increase selectivity of attacks and propose a surrogate-LLM pipeline that generates selective perturbations. Empirically, we find that selective adversarial attacks exist and can materially alter relative rankings, challenging the fairness, reproducibility, and transparency of leaderboard-driven evaluation. Our results motivate perturbation-aware reporting and robustness diagnostics for LLM evaluation and demonstrate that even subtle edits can shift comparative judgments.
Problem

Research questions and friction points this paper is trying to address.

Selectively degrade or enhance LLM performance via adversarial attacks
Challenge fairness and reproducibility of benchmark-driven evaluations
Generate subtle perturbations that alter relative model rankings
Innovation

Methods, ideas, or system contributions that make the work stand out.

Selective adversarial attacks on benchmark evaluations
Custom constraint increases attack selectivity
Surrogate-LLM pipeline generates selective perturbations
🔎 Similar Papers
No similar papers found.
I
Ivan Dubrovsky
ITMO University, Saint Petersburg
A
Anastasia Orlova
ITMO University, Saint Petersburg
I
Illarion Iov
ITMO University, Saint Petersburg
Nina Gubina
Nina Gubina
ITMO University
computer aided drug designapplied artificial intelligencecheminformatics
I
Irena Gureeva
Applied AI Institute, Moscow
Alexey Zaytsev
Alexey Zaytsev
Associate professor at BIMSA
Deep learningMachine learningStatistics