Is your benchmark truly adversarial? AdvScore: Evaluating Human-Grounded Adversarialness

📅 2024-06-24

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Existing adversarial datasets lack standardized robustness metrics, making it difficult to assess whether they remain challenging for modern models. Method: We propose AdvScore—the first human-grounded, computable, and human-aligned metric for quantifying adversariality, defined as the degree to which questions are easy for humans but hard for models—and build AdvQA, a high-quality adversarial question-answering dataset. Our approach integrates 9,347 human responses with predictions from 10 language models, leveraging capability-difference modeling and outlier detection to establish a dynamic evaluation framework enabling longitudinal robustness tracking across model generations and dataset timeliness diagnostics. Results: Experiments reveal consistent improvements in mainstream models’ adversarial QA performance from 2020–2024; several popular benchmarks exhibit significant degradation; and AdvQA achieves superior human–model performance separation, demonstrating enhanced discriminative power for evaluating model robustness.

Technology Category

Application Category

📝 Abstract

Adversarial datasets should validate AI robustness by providing samples on which humans perform well, but models do not. However, as models evolve, datasets can become obsolete. Measuring whether a dataset remains adversarial is hindered by the lack of a standardized metric for measuring adversarialness. We propose AdvScore, a human-grounded evaluation metric that assesses a dataset's adversarialness by capturing models' and humans' varying abilities while also identifying poor examples. We then use AdvScore to motivate a new dataset creation pipeline for realistic and high-quality adversarial samples, enabling us to collect an adversarial question answering (QA) dataset, AdvQA. We apply AdvScore using 9,347 human responses and ten language models' predictions to track model improvement over five years, from 2020 to 2024. AdvScore thus provides guidance for achieving robustness comparable with human capabilities. Furthermore, it helps determine to what extent adversarial datasets continue to pose challenges, ensuring that, rather than reflecting outdated or overly artificial difficulties, they effectively test model capabilities.

Problem

Research questions and friction points this paper is trying to address.

Developing a metric for dataset adversarialness

Creating realistic adversarial QA datasets

Tracking AI model robustness over time

Innovation

Methods, ideas, or system contributions that make the work stand out.

AdvScore metric evaluates adversarialness

Pipeline creates realistic adversarial datasets

AdvScore tracks model improvement over time

🔎 Similar Papers

Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?

2024-07-31arXiv.orgCitations: 5