🤖 AI Summary
Existing studies on Transformer-based question answering (QA) models lack fine-grained adversarial noise benchmarks and unified quantitative metrics for evaluating robustness to context perturbations. Method: We construct the first comprehensive adversarial context perturbation benchmark—built upon SQuAD—with seven distinct noise types and five intensity levels, and propose a standardized robustness measurement framework enabling cross-noise-type and cross-intensity comparisons. Contribution/Results: Empirical evaluation reveals that mainstream QA models (e.g., BERT, RoBERTa) exhibit high sensitivity to context perturbations, with performance degradation following a pronounced nonlinear pattern; moreover, different noise types induce markedly heterogeneous impacts. This work establishes a reproducible benchmark, comparable evaluation metrics, and critical empirical evidence to advance the design and assessment of robust QA systems.
📝 Abstract
Contextual question-answering models are susceptible to adversarial perturbations to input context, commonly observed in real-world scenarios. These adversarial noises are designed to degrade the performance of the model by distorting the textual input. We introduce a unique dataset that incorporates seven distinct types of adversarial noise into the context, each applied at five different intensity levels on the SQuAD dataset. To quantify the robustness, we utilize robustness metrics providing a standardized measure for assessing model performance across varying noise types and levels. Experiments on transformer-based question-answering models reveal robustness vulnerabilities and important insights into the model's performance in realistic textual input.