🤖 AI Summary
Legal proposition generation is crucial for legal reasoning, yet it lacks systematic evaluation methodologies. This work proposes LP-Eval, the first expert-annotated benchmark specifically designed for evaluating legal proposition generation, comprising 100 propositions derived from Court of Justice of the European Union rulings and a three-step evaluation protocol that decomposes proposition quality into formal validity and substantive dimensions. By integrating legal expert knowledge into the evaluation criteria and comparing rule-guided large language model (LLM) assessments against human expert judgments, the study finds that LLMs can generate formally sound and high-quality legal propositions—particularly when grounded in landmark cases—and that criterion-guided LLM evaluation aligns more closely with expert judgment than direct scoring. Nevertheless, LLMs still fall short in replicating human sensitivity to nuanced distinctions.
📝 Abstract
Legal proposition generation is central to legal reasoning and doctrinal scholarship, yet remain under-examined in Legal NLP. This paper investigates the automatic generation and evaluation of legal propositions from decisions of the Court of Justice of the European Union using large language models (LLMs). We introduce LP-Eval, a three-step evaluation rubric co-designed with legal experts that decomposes legal proposition quality into formal validity and substantive dimensions. Using this rubric, we release a dataset of two experts' annotations for 100 LLM-generated legal propositions. Our results show that LLMs can generate predominantly well-formed and high-quality propositions, while expert evaluations reveal higher quality for propositions derived from well established cases than from recent ones. We further examine LLMs as evaluators and find that rubric-guided LLM judgments align more closely with expert assessments than direct overall scoring, but remain insensitive to finer-grained distinctions captured by human experts.