*-PLUIE: Personalisable metric with Llm Used for Improved Evaluation

📅 2026-02-17

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This work addresses the high computational cost and post-processing requirements of existing LLM-as-a-judge approaches for automatic text generation evaluation. Building upon ParaPLUIE, the authors propose task-specific prompt variants, termed *-PLUIE, which estimate the perplexity of an LLM’s “yes/no” responses to assess output confidence without generating full text. By incorporating customizable task-oriented prompting strategies, *-PLUIE achieves substantially higher alignment with human judgments while maintaining minimal computational overhead. Experimental results demonstrate that *-PLUIE consistently outperforms baseline methods in correlation with human ratings across multiple tasks, offering an efficient and effective alternative for model evaluation.

Technology Category

Application Category

📝 Abstract

Evaluating the quality of automatically generated text often relies on LLM-as-a-judge (LLM-judge) methods. While effective, these approaches are computationally expensive and require post-processing. To address these limitations, we build upon ParaPLUIE, a perplexity-based LLM-judge metric that estimates confidence over ``Yes/No'' answers without generating text. We introduce *-PLUIE, task specific prompting variants of ParaPLUIE and evaluate their alignment with human judgement. Our experiments show that personalised *-PLUIE achieves stronger correlations with human ratings while maintaining low computational cost.

Problem

Research questions and friction points this paper is trying to address.

LLM-as-a-judge

text evaluation

computational cost

human alignment

automatic text generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-as-a-judge

perplexity-based evaluation

personalized prompting