Domain-Adaptation through Synthetic Data: Fine-Tuning Large Language Models for German Law

📅 2026-01-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of hallucination in large language models when applied to high-stakes, knowledge-intensive domains such as German law, where a lack of expert knowledge undermines response reliability. To mitigate this, the authors propose a systematic approach that automatically generates high-quality, diverse question-answer pairs from authoritative legal texts, followed by automated quality filtering and parameter-efficient fine-tuning (e.g., using LoRA), thereby enabling domain adaptation without manual annotation. Experimental results demonstrate that models fine-tuned on this synthetically generated data significantly outperform baseline approaches on German legal question-answering tasks, confirming the method’s effectiveness in enhancing factual accuracy in specialized, knowledge-driven settings.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) often struggle in specialized domains such as legal reasoning due to limited expert knowledge, resulting in factually incorrect outputs or hallucinations. This paper presents an effective method for adapting advanced LLMs to German legal question answering through a novel synthetic data generation approach. In contrast to costly human-annotated resources or unreliable synthetic alternatives, our approach systematically produces high-quality, diverse, and legally accurate question-answer pairs directly from authoritative German statutes. Using rigorous automated filtering methods and parameter-efficient fine-tuning techniques, we demonstrate that LLMs adapted with our synthetic dataset significantly outperform their baseline counterparts on German legal question answering tasks. Our results highlight the feasibility of using carefully designed synthetic data as a robust alternative to manual annotation in high-stakes, knowledge-intensive domains.
Problem

Research questions and friction points this paper is trying to address.

domain adaptation
legal reasoning
large language models
synthetic data
German law
Innovation

Methods, ideas, or system contributions that make the work stand out.

synthetic data generation
domain adaptation
legal question answering
parameter-efficient fine-tuning
large language models