🤖 AI Summary
Evaluating long-form responses in high-stakes domains (e.g., law, medicine) lacks methods that jointly ensure semantic correctness and interpretability. To address this, we propose DeCE, a decoupled LLM evaluation framework that— for the first time—introduces precision/recall principles into automated LLM assessment, decomposing answer quality into two orthogonal dimensions: factual accuracy (precision) and coverage completeness (recall). DeCE leverages LLMs to automatically extract instantiated criteria from gold answers, enabling fine-grained, classification-free decomposition scoring without predefined taxonomies. The framework is model- and domain-agnostic. Evaluated on real-world legal question answering, DeCE achieves a correlation of 0.78 with expert judgments—significantly outperforming BLEU, ROUGE, and state-of-the-art LLM-based evaluators. Moreover, it uncovers a systematic precision–coverage trade-off between general-purpose and domain-specialized models.
📝 Abstract
Evaluating long-form answers in high-stakes domains such as law or medicine remains a fundamental challenge. Standard metrics like BLEU and ROUGE fail to capture semantic correctness, and current LLM-based evaluators often reduce nuanced aspects of answer quality into a single undifferentiated score. We introduce DeCE, a decomposed LLM evaluation framework that separates precision (factual accuracy and relevance) and recall (coverage of required concepts), using instance-specific criteria automatically extracted from gold answer requirements. DeCE is model-agnostic and domain-general, requiring no predefined taxonomies or handcrafted rubrics. We instantiate DeCE to evaluate different LLMs on a real-world legal QA task involving multi-jurisdictional reasoning and citation grounding. DeCE achieves substantially stronger correlation with expert judgments ($r=0.78$), compared to traditional metrics ($r=0.12$), pointwise LLM scoring ($r=0.35$), and modern multidimensional evaluators ($r=0.48$). It also reveals interpretable trade-offs: generalist models favor recall, while specialized models favor precision. Importantly, only 11.95% of LLM-generated criteria required expert revision, underscoring DeCE's scalability. DeCE offers an interpretable and actionable LLM evaluation framework in expert domains.