PEEM: Prompt Engineering Evaluation Metrics for Interpretable Joint Evaluation of Prompts and Responses

📅 2026-03-11

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

This work addresses the limited scope of existing large language model (LLM) prompt evaluation, which predominantly focuses on response correctness while neglecting interpretable, fine-grained joint analysis of prompt quality and its relationship with generated responses. To bridge this gap, the authors propose PEEM, a novel framework enabling the first interpretable joint evaluation of prompts and responses. PEEM employs a structured scoring system comprising three prompt-level and six response-level metrics, leveraging LLMs for zero-shot scoring accompanied by natural language rationales. The framework demonstrates robust diagnostic stability under various perturbations and exhibits high alignment with conventional accuracy metrics (Spearman ρ ≈ 0.97). Furthermore, zero-shot prompt rewrites guided by PEEM feedback improve downstream task accuracy by up to 11.7 points, significantly outperforming supervised and reinforcement learning baselines.

Technology Category

Application Category

📝 Abstract

Prompt design is a primary control interface for large language models (LLMs), yet standard evaluations largely reduce performance to answer correctness, obscuring why a prompt succeeds or fails and providing little actionable guidance. We propose PEEM (Prompt Engineering Evaluation Metrics), a unified framework for joint and interpretable evaluation of both prompts and responses. PEEM defines a structured rubric with 9 axes: 3 prompt criteria (clarity/structure, linguistic quality, fairness) and 6 response criteria (accuracy, coherence, relevance, objectivity, clarity, conciseness), and uses an LLM-based evaluator to output (i) scalar scores on a 1-5 Likert scale and (ii) criterion-specific natural-language rationales grounded in the rubric. Across 7 benchmarks and 5 task models, PEEM's accuracy axis strongly aligns with conventional accuracy while preserving model rankings (aggregate Spearman rho about 0.97, Pearson r about 0.94, p<0.001). A multi-evaluator study with four models shows consistent relative judgments (pairwise rho = 0.68-0.85), supporting evaluator-agnostic deployment. Beyond alignment, PEEM captures complementary linguistic failure modes and remains informative under prompt perturbations: prompt-quality trends track downstream accuracy under iterative rewrites, semantic adversarial manipulations induce clear score degradation, and meaning-preserving paraphrases yield high stability (robustness rate about 76.7-80.6%). Finally, using only PEEM scores and rationales as feedback, a zero-shot prompt rewriting loop improves downstream accuracy by up to 11.7 points, outperforming supervised and RL-based prompt-optimization baselines. Overall, PEEM provides a reproducible, criterion-driven protocol that links prompt formulation to response behavior and enables systematic diagnosis and optimization of LLM interactions.

Problem

Research questions and friction points this paper is trying to address.

prompt engineering

evaluation metrics

large language models

interpretable evaluation

prompt-response analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Prompt Engineering

Interpretable Evaluation

LLM-based Evaluator