Evaluating Text Style Transfer Evaluation: Are There Any Reliable Metrics?

📅 2025-02-07

📈 Citations: 0

✨ Influential: 0

career value

154K/year

🤖 AI Summary

Text style transfer (TST) lacks reliable, automated evaluation metrics—especially in multilingual and cross-task settings. To address this, we conduct the first systematic meta-evaluation of TST, covering sentiment transfer and detoxification tasks across English, Hindi, and Bengali. We propose a hybrid metric framework integrating BERTScore, BLEURT, MAUVE, and LLM-based assessment, augmented with an ensemble strategy. Experimental results demonstrate that general-purpose NLP metrics consistently outperform traditional TST-specific metrics; our hybrid approach improves average Spearman correlation with human judgments by 23%, significantly enhancing consistency, accuracy, and reproducibility. This work establishes the first empirically validated, standardized evaluation paradigm for multilingual TST.

Technology Category

Application Category

📝 Abstract

Text Style Transfer (TST) is the task of transforming a text to reflect a particular style while preserving its original content. Evaluating TST outputs is a multidimensional challenge, requiring the assessment of style transfer accuracy, content preservation, and naturalness. Using human evaluation is ideal but costly, same as in other natural language processing (NLP) tasks, however, automatic metrics for TST have not received as much attention as metrics for, e.g., machine translation or summarization. In this paper, we examine both set of existing and novel metrics from broader NLP tasks for TST evaluation, focusing on two popular subtasks-sentiment transfer and detoxification-in a multilingual context comprising English, Hindi, and Bengali. By conducting meta-evaluation through correlation with human judgments, we demonstrate the effectiveness of these metrics when used individually and in ensembles. Additionally, we investigate the potential of Large Language Models (LLMs) as tools for TST evaluation. Our findings highlight that certain advanced NLP metrics and experimental-hybrid-techniques, provide better insights than existing TST metrics for delivering more accurate, consistent, and reproducible TST evaluations.

Problem

Research questions and friction points this paper is trying to address.

Assessing metrics for Text Style Transfer evaluation.

Evaluating style, content, and naturalness preservation.

Exploring multilingual metrics and LLMs' evaluation potential.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Using advanced NLP metrics

Applying experimental hybrid techniques

Leveraging Large Language Models

🔎 Similar Papers

No similar papers found.