🤖 AI Summary
Large language models (LLMs) exhibit high sensitivity to minor lexical or syntactic variations in prompts; however, existing evaluation methods often rely on hand-crafted or unnatural perturbations, failing to reflect robustness under authentic linguistic usage. Method: We propose the first linguistics-driven minimal transformation classification framework for prompt rewriting—characterized by fine-grained, controllable, and interpretable transformations grounded in user context. Our approach integrates BBQ benchmark adaptation, dual-verification via human annotation and automated consistency checking, and quantitative stability analysis. Contribution/Results: Experiments reveal that natural paraphrasing induces accuracy fluctuations exceeding 20%, exposing a widespread lack of paraphrase robustness in current LLM evaluations. This work establishes a foundational paradigm for paraphrase-aware LLM assessment, advancing evaluation standards toward linguistic realism and contextual fidelity.
📝 Abstract
Small changes in how a prompt is worded can lead to meaningful differences in the behavior of large language models (LLMs), raising concerns about the stability and reliability of their evaluations. While prior work has explored simple formatting changes, these rarely capture the kinds of natural variation seen in real-world language use. We propose a controlled paraphrasing framework based on a taxonomy of minimal linguistic transformations to systematically generate natural prompt variations. Using the BBQ dataset, we validate our method with both human annotations and automated checks, then use it to study how LLMs respond to paraphrased prompts in stereotype evaluation tasks. Our analysis shows that even subtle prompt modifications can lead to substantial changes in model behavior. These results highlight the need for robust, paraphrase-aware evaluation protocols.