A Meta-Evaluation of Style and Attribute Transfer Metrics

📅 2025-02-20

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

Existing content preservation evaluation methods for text style/attribute transfer—relying on lexical or semantic similarity metrics or current LLM-based evaluators—fail to model stylistic conditionality, resulting in low correlation with human judgments. Method: The authors propose, for the first time, that content preservation assessment must be *conditioned on style transfer*, and introduce a zero-shot evaluation method based on next-token conditional likelihood. They further construct a human-annotated benchmark specifically designed for meta-evaluation alignment to systematically validate the necessity of conditional modeling across multiple style transfer tasks. Contribution/Results: Experiments demonstrate that the proposed method significantly outperforms baseline approaches, achieving an average 23% improvement in correlation with human judgments. This work establishes conditional modeling as essential for accurate, human-aligned content preservation evaluation in style transfer.

Technology Category

Application Category

📝 Abstract

LLMs make it easy to rewrite text in any style, be it more polite, persuasive, or more positive. We present a large-scale study of evaluation metrics for style and attribute transfer with a focus on content preservation; meaning content not attributed to the style shift is preserved. The de facto evaluation approach uses lexical or semantic similarity metrics often between source sentences and rewrites. While these metrics are not designed to distinguish between style or content differences, empirical meta-evaluation shows a reasonable correlation to human judgment. In fact, recent works find that LLMs prompted as evaluators are only comparable to semantic similarity metrics, even though intuitively, the LLM approach should better fit the task. To investigate this discrepancy, we benchmark 8 metrics for evaluating content preservation on existing datasets and additionally construct a new test set that better aligns with the meta-evaluation aim. Indeed, we then find that the empirical conclusion aligns with the intuition: content preservation metrics for style/attribute transfer must be conditional on the style shift. To support this, we propose a new efficient zero-shot evaluation method using the likelihood of the next token. We hope our meta-evaluation can foster more research on evaluating content preservation metrics, and also to ensure fair evaluation of methods for conducting style transfer.

Problem

Research questions and friction points this paper is trying to address.

Evaluate style and attribute transfer metrics

Assess content preservation in text rewriting

Propose zero-shot method for efficient evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale study metrics

Zero-shot evaluation method

Conditional content preservation metrics

🔎 Similar Papers

No similar papers found.