Fine-tuning Large Language Models for Improving Factuality in Legal Question Answering

📅 2025-01-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address severe hallucination and frequent factual errors in large language models (LLMs) for legal question answering, this paper introduces LegalHalBench—the first domain-specific benchmark for evaluating legal hallucinations—and proposes Hard-sample-aware Iterative Direct Preference Optimization (HIPO), the first framework integrating behavior cloning, hard-sample identification, and iterative DPO to enhance semantic alignment with legal texts and enable automatic hallucination pattern detection. Compared to strong baselines, HIPO significantly improves non-hallucinatory statute recall, statute relevance, and factual accuracy of legal claims. It achieves state-of-the-art performance across METEOR, BERTScore, ROUGE-L, and human preference win rates. Our core contributions are: (1) LegalHalBench, the first dedicated legal hallucination evaluation benchmark; and (2) HIPO, the first fine-tuning paradigm explicitly designed for factual consistency in legal LLMs.

Technology Category

Application Category

📝 Abstract
Hallucination, or the generation of incorrect or fabricated information, remains a critical challenge in large language models (LLMs), particularly in high-stake domains such as legal question answering (QA). In order to mitigate the hallucination rate in legal QA, we first introduce a benchmark called LegalHalBench and three automatic metrics to evaluate the common hallucinations when LLMs answer legal questions. We then propose a hallucination mitigation method that integrates behavior cloning and a novel Hard Sample-aware Iterative Direct Preference Optimization (HIPO). We conduct extensive real-data experiments to validate the effectiveness of our approach. Our results demonstrate remarkable improvements in various metrics, including the newly proposed Non-Hallucinated Statute Rate, Statute Relevance Rate, Legal Claim Truthfulness, as well as traditional metrics such as METEOR, BERTScore, ROUGE-L, and win rates.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Legal Question Accuracy
Information Fabrication
Innovation

Methods, ideas, or system contributions that make the work stand out.

Imitation Learning
HIPO Method
Legal Information Accuracy
Y
Yinghao Hu
School of Software Technology, Zhejiang University
Leilei Gan
Leilei Gan
Zhejiang University
NLPLLMsMultimodal LLMsAI+X
Wenyi Xiao
Wenyi Xiao
Zhejiang University
Kun Kuang
Kun Kuang
Zhejiang University
Causal InferenceData MiningMachine Learning
F
Fei Wu
College of Computer Science and Technology, Zhejiang University