🤖 AI Summary
This work addresses the high computational cost and post-processing requirements of existing LLM-as-a-judge approaches for automatic text generation evaluation. Building upon ParaPLUIE, the authors propose task-specific prompt variants, termed *-PLUIE, which estimate the perplexity of an LLM’s “yes/no” responses to assess output confidence without generating full text. By incorporating customizable task-oriented prompting strategies, *-PLUIE achieves substantially higher alignment with human judgments while maintaining minimal computational overhead. Experimental results demonstrate that *-PLUIE consistently outperforms baseline methods in correlation with human ratings across multiple tasks, offering an efficient and effective alternative for model evaluation.
📝 Abstract
Evaluating the quality of automatically generated text often relies on LLM-as-a-judge (LLM-judge) methods. While effective, these approaches are computationally expensive and require post-processing. To address these limitations, we build upon ParaPLUIE, a perplexity-based LLM-judge metric that estimates confidence over ``Yes/No'' answers without generating text. We introduce *-PLUIE, task specific prompting variants of ParaPLUIE and evaluate their alignment with human judgement. Our experiments show that personalised *-PLUIE achieves stronger correlations with human ratings while maintaining low computational cost.