🤖 AI Summary
This study investigates the robustness of NLP models under systematic, minimal linguistic perturbations—spanning orthography, syntax, dialect, and style. We propose the first language-driven, task-agnostic framework for controllable perturbation generation, integrating LLM-based prompt engineering with human-in-the-loop validation to construct high-quality, multi-granularity perturbed datasets. A cross-task benchmarking protocol is designed and evaluated across four mainstream NLP tasks. Key findings are: (1) negation-based modifications induce widespread vulnerability, revealing a shared weakness; (2) LLMs exhibit greater robustness than traditional models but still suffer from significant language-level deficiencies; (3) perturbation effects are strongly task-dependent. Our work establishes a novel systematic robustness evaluation paradigm and provides both theoretical foundations and practical tools for modeling linguistic robustness.
📝 Abstract
We present FLUKE (Framework for LingUistically-driven and tasK-agnostic robustness Evaluation), a task-agnostic framework for assessing model robustness through systematic minimal variations of test data. FLUKE introduces controlled variations across linguistic levels - from orthography to dialect and style varieties - and leverages large language models (LLMs) with human validation to generate modifications. We demonstrate FLUKE's utility by evaluating both fine-tuned models and LLMs across four diverse NLP tasks, and reveal that (1) the impact of linguistic variations is highly task-dependent, with some tests being critical for certain tasks but irrelevant for others; (2) while LLMs have better overall robustness compared to fine-tuned models, they still exhibit significant brittleness to certain linguistic variations; (3) all models show substantial vulnerability to negation modifications across most tasks. These findings highlight the importance of systematic robustness testing for understanding model behaviors.