🤖 AI Summary
Existing content preservation evaluation methods for text style/attribute transfer—relying on lexical or semantic similarity metrics or current LLM-based evaluators—fail to model stylistic conditionality, resulting in low correlation with human judgments. Method: The authors propose, for the first time, that content preservation assessment must be *conditioned on style transfer*, and introduce a zero-shot evaluation method based on next-token conditional likelihood. They further construct a human-annotated benchmark specifically designed for meta-evaluation alignment to systematically validate the necessity of conditional modeling across multiple style transfer tasks. Contribution/Results: Experiments demonstrate that the proposed method significantly outperforms baseline approaches, achieving an average 23% improvement in correlation with human judgments. This work establishes conditional modeling as essential for accurate, human-aligned content preservation evaluation in style transfer.
📝 Abstract
LLMs make it easy to rewrite text in any style, be it more polite, persuasive, or more positive. We present a large-scale study of evaluation metrics for style and attribute transfer with a focus on content preservation; meaning content not attributed to the style shift is preserved. The de facto evaluation approach uses lexical or semantic similarity metrics often between source sentences and rewrites. While these metrics are not designed to distinguish between style or content differences, empirical meta-evaluation shows a reasonable correlation to human judgment. In fact, recent works find that LLMs prompted as evaluators are only comparable to semantic similarity metrics, even though intuitively, the LLM approach should better fit the task. To investigate this discrepancy, we benchmark 8 metrics for evaluating content preservation on existing datasets and additionally construct a new test set that better aligns with the meta-evaluation aim. Indeed, we then find that the empirical conclusion aligns with the intuition: content preservation metrics for style/attribute transfer must be conditional on the style shift. To support this, we propose a new efficient zero-shot evaluation method using the likelihood of the next token. We hope our meta-evaluation can foster more research on evaluating content preservation metrics, and also to ensure fair evaluation of methods for conducting style transfer.