LawGPT: Knowledge-Guided Data Generation and Its Application to Legal LLM

📅 2025-02-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the performance degradation of open-source large language models (LLMs) on legal reasoning tasks due to domain-specific data scarcity. To this end, we propose KgDG—a knowledge-guided data generation framework that leverages a legal knowledge graph to steer synthetic data creation and employs a two-stage verification mechanism combining rule-based checks and LLM-assisted validation to ensure high fidelity and reasoning quality. Based on KgDG, we construct LawSynth, the first 50K-scale synthetic legal reasoning dataset. Using LawSynth, we jointly optimize instruction tuning and data augmentation to train LawGPT. Experiments demonstrate that LawGPT significantly outperforms existing open-source legal LLMs across multiple legal reasoning benchmarks and matches the performance of leading closed-source commercial models. This work marks the first systematic application of knowledge-injected synthetic data generation for training legal LLMs.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs), both proprietary and open-source, have demonstrated remarkable capabilities across various natural language processing tasks. However, they face significant limitations in legal reasoning tasks. Proprietary models introduce data privacy risks and high inference costs, while open-source models underperform due to insufficient legal domain training data. To address these limitations, we study data generation for legal reasoning to improve the legal reasoning performance of open-source LLMs with the help of proprietary LLMs. This is challenging due to the lack of legal knowledge in proprietary LLMs and the difficulty in verifying the generated data. We propose KgDG, a knowledge-guided data generation framework for legal reasoning. Our framework enables leveraging legal knowledge to enhance generation diversity and introduces a refinement and verification process to ensure the quality of generated data. Moreover, we expand the generated dataset to further enhance the LLM reasoning capabilities. Using KgDG, we create a synthetic legal reasoning dataset containing 50K high-quality examples. Our trained model LawGPT outperforms existing legal-specific LLMs and achieves performance comparable to proprietary LLMs, demonstrating the effectiveness of KgDG and LawGPT. Our code and resources is publicly available at https://anonymous.4open.science/r/KgDG-45F5 .
Problem

Research questions and friction points this paper is trying to address.

Improving legal reasoning in LLMs
Generating legal domain training data
Ensuring data quality and diversity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Knowledge-guided data generation
Refinement and verification process
Synthetic legal reasoning dataset