🤖 AI Summary
Current reference-free dialogue evaluation metrics lack rigorous robustness validation. Method: We construct the first adversarial perturbation benchmark to systematically assess the stability of prominent metrics—including DialogRPT, UniEval, and PromptEval—under four attack types: speaker-label corruption, grammatically erroneous responses, context repetition, and grounding omission. Contribution/Results: Experiments reveal that while these metrics exhibit comparable correlation with human judgments on standard benchmarks, their performance diverges markedly on adversarial examples; crucially, correlation strength does not align with robustness. We thus propose, for the first time, a decoupled evaluation paradigm distinguishing *effectiveness* (correlation with human judgment) from *robustness* (resilience to adversarial perturbations), empirically validating both the necessity and feasibility of robustness assessment. This work establishes theoretical foundations and provides practical tools for trustworthy dialogue evaluation.
📝 Abstract
Advancements in dialogue systems powered by large language models (LLMs) have outpaced the development of reliable evaluation metrics, particularly for diverse and creative responses. We present a benchmark for evaluating the robustness of reference-free dialogue metrics against four categories of adversarial attacks: speaker tag prefixes, static responses, ungrammatical responses, and repeated conversational context. We analyze metrics such as DialogRPT, UniEval, and PromptEval -- a prompt-based method leveraging LLMs -- across grounded and ungrounded datasets. By examining both their correlation with human judgment and susceptibility to adversarial attacks, we find that these two axes are not always aligned; metrics that appear to be equivalent when judged by traditional benchmarks may, in fact, vary in their scores of adversarial responses. These findings motivate the development of nuanced evaluation frameworks to address real-world dialogue challenges.