*-PLUIE: Personalisable metric with Llm Used for Improved Evaluation

📅 2026-02-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high computational cost and post-processing requirements of existing LLM-as-a-judge approaches for automatic text generation evaluation. Building upon ParaPLUIE, the authors propose task-specific prompt variants, termed *-PLUIE, which estimate the perplexity of an LLM’s “yes/no” responses to assess output confidence without generating full text. By incorporating customizable task-oriented prompting strategies, *-PLUIE achieves substantially higher alignment with human judgments while maintaining minimal computational overhead. Experimental results demonstrate that *-PLUIE consistently outperforms baseline methods in correlation with human ratings across multiple tasks, offering an efficient and effective alternative for model evaluation.

Technology Category

Application Category

📝 Abstract
Evaluating the quality of automatically generated text often relies on LLM-as-a-judge (LLM-judge) methods. While effective, these approaches are computationally expensive and require post-processing. To address these limitations, we build upon ParaPLUIE, a perplexity-based LLM-judge metric that estimates confidence over ``Yes/No'' answers without generating text. We introduce *-PLUIE, task specific prompting variants of ParaPLUIE and evaluate their alignment with human judgement. Our experiments show that personalised *-PLUIE achieves stronger correlations with human ratings while maintaining low computational cost.
Problem

Research questions and friction points this paper is trying to address.

LLM-as-a-judge
text evaluation
computational cost
human alignment
automatic text generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-as-a-judge
perplexity-based evaluation
personalized prompting
efficient text evaluation
human alignment
🔎 Similar Papers
No similar papers found.
Q
Quentin Lemesle
Univ Rennes, CNRS, IRISA, Expression
Léane Jourdan
Léane Jourdan
Doctorante en informatique, LS2N, Nantes Université
NLPWriting assistanceText revision
D
Daisy Munson
Univ Rennes, CNRS, IRISA, Sotern
P
Pierre Alain
Univ Rennes, CNRS, IRISA, Sotern
J
Jonathan Chevelu
Univ Rennes, CNRS, IRISA, Expression
Arnaud Delhay
Arnaud Delhay
Université de Rennes - IRISA
speech processingcomputational complexityanalogical proportionsanomaly detection
Damien Lolive
Damien Lolive
UBS, CNRS, IRISA
NLPtext-to-speech synthesisspeech processing