Evaluating Text Style Transfer: A Nine-Language Benchmark for Text Detoxification

📅 2025-07-21

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

Automatic evaluation of text style transfer—particularly detoxification—exhibits significant discrepancies with human judgments, and existing work is predominantly English-centric, lacking systematic multilingual investigation. Method: We introduce the first multilingual text detoxification benchmark covering nine languages—English, Spanish, German, Chinese, Arabic, Hindi, Ukrainian, Russian, and Amharic—and conduct the first comprehensive cross-lingual evaluation of detoxification. Drawing inspiration from machine translation evaluation paradigms, we comparatively analyze neural evaluation models and large language model–based prompt-based classifiers. Contribution/Results: We propose design principles for reliable multilingual style transfer evaluation frameworks and empirically validate the cross-lingual robustness of diverse automatic metrics. Our study provides both theoretical foundations and practical guidelines for building trustworthy, multilingual style transfer assessment systems.

Technology Category

Application Category

📝 Abstract

Despite recent progress in large language models (LLMs), evaluation of text generation tasks such as text style transfer (TST) remains a significant challenge. Recent studies (Dementieva et al., 2024; Pauli et al., 2025) revealed a substantial gap between automatic metrics and human judgments. Moreover, most prior work focuses exclusively on English, leaving multilingual TST evaluation largely unexplored. In this paper, we perform the first comprehensive multilingual study on evaluation of text detoxification system across nine languages: English, Spanish, German, Chinese, Arabic, Hindi, Ukrainian, Russian, Amharic. Drawing inspiration from the machine translation, we assess the effectiveness of modern neural-based evaluation models alongside prompting-based LLM-as-a-judge approaches. Our findings provide a practical recipe for designing more reliable multilingual TST evaluation pipeline in the text detoxification case.

Problem

Research questions and friction points this paper is trying to address.

Evaluating text style transfer lacks reliable multilingual benchmarks.

Automatic metrics poorly align with human judgments in TST.

Current TST evaluation focuses narrowly on English, ignoring other languages.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual study across nine languages

Neural-based and LLM-as-a-judge evaluation

Practical recipe for reliable TST evaluation

🔎 Similar Papers

No similar papers found.