FLUKE: A Linguistically-Driven and Task-Agnostic Framework for Robustness Evaluation

📅 2025-04-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the robustness of NLP models under systematic, minimal linguistic perturbations—spanning orthography, syntax, dialect, and style. We propose the first language-driven, task-agnostic framework for controllable perturbation generation, integrating LLM-based prompt engineering with human-in-the-loop validation to construct high-quality, multi-granularity perturbed datasets. A cross-task benchmarking protocol is designed and evaluated across four mainstream NLP tasks. Key findings are: (1) negation-based modifications induce widespread vulnerability, revealing a shared weakness; (2) LLMs exhibit greater robustness than traditional models but still suffer from significant language-level deficiencies; (3) perturbation effects are strongly task-dependent. Our work establishes a novel systematic robustness evaluation paradigm and provides both theoretical foundations and practical tools for modeling linguistic robustness.

Technology Category

Application Category

📝 Abstract
We present FLUKE (Framework for LingUistically-driven and tasK-agnostic robustness Evaluation), a task-agnostic framework for assessing model robustness through systematic minimal variations of test data. FLUKE introduces controlled variations across linguistic levels - from orthography to dialect and style varieties - and leverages large language models (LLMs) with human validation to generate modifications. We demonstrate FLUKE's utility by evaluating both fine-tuned models and LLMs across four diverse NLP tasks, and reveal that (1) the impact of linguistic variations is highly task-dependent, with some tests being critical for certain tasks but irrelevant for others; (2) while LLMs have better overall robustness compared to fine-tuned models, they still exhibit significant brittleness to certain linguistic variations; (3) all models show substantial vulnerability to negation modifications across most tasks. These findings highlight the importance of systematic robustness testing for understanding model behaviors.
Problem

Research questions and friction points this paper is trying to address.

Evaluating model robustness via systematic test data variations
Assessing task-dependent impact of linguistic variations on models
Identifying model vulnerabilities to specific linguistic modifications
Innovation

Methods, ideas, or system contributions that make the work stand out.

Task-agnostic framework for robustness evaluation
Linguistic variations across multiple levels
LLMs and human validation for modifications
🔎 Similar Papers
No similar papers found.