🤖 AI Summary
This work reveals that LLM benchmarks (e.g., MMLU) are vulnerable to *selective adversarial attacks* that preserve semantic meaning: minimal input perturbations can significantly degrade or boost the performance of specific models while leaving others nearly unaffected—thereby distorting leaderboard rankings and undermining fairness, reproducibility, and transparency in evaluation. To address this, we formally define and implement the first selective adversarial attack framework for LLM benchmarks. Our contributions are threefold: (1) a quantifiable selectivity evaluation protocol; (2) a dual-constraint optimization mechanism balancing semantic consistency and model response divergence; and (3) an efficient perturbation generation pipeline integrating proxy LLMs and the TextAttack framework. Experiments demonstrate reliable rank reversal across models, exposing the high sensitivity of current benchmarks to fine-grained edits. This underscores the urgent need for perturbation-aware reporting and robustness diagnostics as a new paradigm in LLM evaluation.
📝 Abstract
Benchmarking outcomes increasingly govern trust, selection, and deployment of LLMs, yet these evaluations remain vulnerable to semantically equivalent adversarial perturbations. Prior work on adversarial robustness in NLP has emphasized text attacks that affect many models equally, leaving open the question of whether it is possible to selectively degrade or enhance performance while minimally affecting other models. We formalize this problem and study selective adversarial attacks on MMLU - a widely used benchmark designed to measure a language model's broad general knowledge and reasoning ability across different subjects. Using canonical attacks integrated into TextAttack framework, we introduce a protocol for selectivity assessment, develop a custom constraint to increase selectivity of attacks and propose a surrogate-LLM pipeline that generates selective perturbations. Empirically, we find that selective adversarial attacks exist and can materially alter relative rankings, challenging the fairness, reproducibility, and transparency of leaderboard-driven evaluation. Our results motivate perturbation-aware reporting and robustness diagnostics for LLM evaluation and demonstrate that even subtle edits can shift comparative judgments.