LeCoDe: A Benchmark Dataset for Interactive Legal Consultation Dialogue Evaluation

📅 2025-05-26

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

Existing large language models (LLMs) underperform in real-world legal consultation dialogues, primarily due to the absence of high-quality, multi-turn evaluation benchmarks. Method: We introduce LeCoDe—the first authentic, multi-turn legal consultation dialogue dataset—comprising 3,696 dialogues and 110,008 utterances, sourced from live-streaming sessions on short-video platforms and rigorously annotated by legal experts across multiple tiers. We propose a unified dual-dimension evaluation framework assessing *clarification capability* and *professional advice quality*, operationalized via 12 fine-grained metrics; we also pioneer a live-stream–oriented legal dialogue collection paradigm. Contribution/Results: Benchmarking reveals substantial gaps in state-of-the-art models (e.g., GPT-4), with only 39.8% clarification recall and 59% advice quality. LeCoDe establishes a reproducible evaluation baseline and identifies concrete avenues for advancement in legal dialogue systems.

Technology Category

Application Category

📝 Abstract

Legal consultation is essential for safeguarding individual rights and ensuring access to justice, yet remains costly and inaccessible to many individuals due to the shortage of professionals. While recent advances in Large Language Models (LLMs) offer a promising path toward scalable, low-cost legal assistance, current systems fall short in handling the interactive and knowledge-intensive nature of real-world consultations. To address these challenges, we introduce LeCoDe, a real-world multi-turn benchmark dataset comprising 3,696 legal consultation dialogues with 110,008 dialogue turns, designed to evaluate and improve LLMs' legal consultation capability. With LeCoDe, we innovatively collect live-streamed consultations from short-video platforms, providing authentic multi-turn legal consultation dialogues. The rigorous annotation by legal experts further enhances the dataset with professional insights and expertise. Furthermore, we propose a comprehensive evaluation framework that assesses LLMs' consultation capabilities in terms of (1) clarification capability and (2) professional advice quality. This unified framework incorporates 12 metrics across two dimensions. Through extensive experiments on various general and domain-specific LLMs, our results reveal significant challenges in this task, with even state-of-the-art models like GPT-4 achieving only 39.8% recall for clarification and 59% overall score for advice quality, highlighting the complexity of professional consultation scenarios. Based on these findings, we further explore several strategies to enhance LLMs' legal consultation abilities. Our benchmark contributes to advancing research in legal domain dialogue systems, particularly in simulating more real-world user-expert interactions.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' capability in interactive legal consultation dialogues

Addressing shortage of professionals with scalable legal assistance

Assessing clarification and advice quality in consultation scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces LeCoDe benchmark dataset for legal dialogues

Collects live-streamed consultations from short-video platforms

Proposes evaluation framework with 12 metrics

🔎 Similar Papers

No similar papers found.