🤖 AI Summary
This study addresses the lack of systematic evaluation of large language models’ (LLMs) ability to generate legally grounded, authoritative-citation–supported, and compliance-aware responses in the legal domain. To this end, we introduce the first legal citation–enhanced benchmark. Methodologically, we propose a syllogism-driven tripartite alignment framework—integrating citations, responses, and questions—combined with retrieval-augmented generation (RAG), legal semantic alignment modeling, and multi-granularity citation provenance attribution. We release a manually annotated legal QA dataset and an authoritative reference corpus comprising judicial precedents and statutory provisions, covering both public and professional user perspectives. Experiments across two general-purpose and seven domain-specific LMs demonstrate that citation integration substantially improves response legality and credibility. Moreover, our automated evaluation metrics achieve high agreement with human judgments (Krippendorff’s α = 0.89).
📝 Abstract
In this paper, we propose CitaLaw, the first benchmark designed to evaluate LLMs' ability to produce legally sound responses with appropriate citations. CitaLaw features a diverse set of legal questions for both laypersons and practitioners, paired with a comprehensive corpus of law articles and precedent cases as a reference pool. This framework enables LLM-based systems to retrieve supporting citations from the reference corpus and align these citations with the corresponding sentences in their responses. Moreover, we introduce syllogism-inspired evaluation methods to assess the legal alignment between retrieved references and LLM-generated responses, as well as their consistency with user questions. Extensive experiments on 2 open-domain and 7 legal-specific LLMs demonstrate that integrating legal references substantially enhances response quality. Furthermore, our proposed syllogism-based evaluation method exhibits strong agreement with human judgments.