🤖 AI Summary
Existing text personalization quality evaluation lacks dedicated metrics, relying predominantly on multi-LLM meta-evaluation—yet this approach suffers from model bias and prohibitive computational overhead.
Method: We propose PerQ, a lightweight, multilingual-compatible automatic metric that quantifies personalization quality without human annotations. PerQ jointly models generation discrepancies across multiple large and small language models, integrating their meta-evaluative capabilities while incorporating a bias-correction mechanism.
Contribution/Results: Experiments demonstrate that PerQ achieves strong agreement with human judgments across multilingual benchmarks (average Spearman’s ρ = 0.82), while reducing computational cost by 76% compared to conventional multi-LLM ensemble methods. This substantially improves evaluation efficiency and mitigates resource waste, enabling scalable, reliable, and equitable personalization assessment.
📝 Abstract
Since no metrics are available to evaluate specific aspects of a text, such as its personalization quality, the researchers often rely solely on large language models to meta-evaluate such texts. Due to internal biases of individual language models, it is recommended to use multiple of them for combined evaluation, which directly increases costs of such meta-evaluation. In this paper, a computationally efficient method for evaluation of personalization quality of a given text (generated by a language model) is introduced, called PerQ. A case study of comparison of generation capabilities of large and small language models shows the usability of the proposed metric in research, effectively reducing the waste of resources.