Fine-tuning Large Language Models for Improving Factuality in Legal Question Answering

📅 2025-01-11

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

To address severe hallucination and frequent factual errors in large language models (LLMs) for legal question answering, this paper introduces LegalHalBench—the first domain-specific benchmark for evaluating legal hallucinations—and proposes Hard-sample-aware Iterative Direct Preference Optimization (HIPO), the first framework integrating behavior cloning, hard-sample identification, and iterative DPO to enhance semantic alignment with legal texts and enable automatic hallucination pattern detection. Compared to strong baselines, HIPO significantly improves non-hallucinatory statute recall, statute relevance, and factual accuracy of legal claims. It achieves state-of-the-art performance across METEOR, BERTScore, ROUGE-L, and human preference win rates. Our core contributions are: (1) LegalHalBench, the first dedicated legal hallucination evaluation benchmark; and (2) HIPO, the first fine-tuning paradigm explicitly designed for factual consistency in legal LLMs.

Technology Category

Application Category

📝 Abstract

Hallucination, or the generation of incorrect or fabricated information, remains a critical challenge in large language models (LLMs), particularly in high-stake domains such as legal question answering (QA). In order to mitigate the hallucination rate in legal QA, we first introduce a benchmark called LegalHalBench and three automatic metrics to evaluate the common hallucinations when LLMs answer legal questions. We then propose a hallucination mitigation method that integrates behavior cloning and a novel Hard Sample-aware Iterative Direct Preference Optimization (HIPO). We conduct extensive real-data experiments to validate the effectiveness of our approach. Our results demonstrate remarkable improvements in various metrics, including the newly proposed Non-Hallucinated Statute Rate, Statute Relevance Rate, Legal Claim Truthfulness, as well as traditional metrics such as METEOR, BERTScore, ROUGE-L, and win rates.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Legal Question Accuracy

Information Fabrication

Innovation

Methods, ideas, or system contributions that make the work stand out.

Imitation Learning

HIPO Method

Legal Information Accuracy

🔎 Similar Papers

Leveraging Large Language Models for Relevance Judgments in Legal Case Retrieval