๐ค AI Summary
Current large language models exhibit significant limitations in legal reasoning and cross-jurisdictional understanding. Method: We introduce LawInstructโthe first large-scale, multijurisdictional (17 jurisdictions), multilingual (24 languages) legal instruction dataset, comprising 12 million samples across tasks including legal question answering, case summarization, and argument mining. Our approach systematically unifies 58 heterogeneous legal datasets into a standardized instruction format, proposes a novel cross-jurisdictional and cross-lingual legal instruction tuning paradigm, and conducts fine-tuning and evaluation on Flan-T5. Results: The resulting model, FLawN-T5, achieves an average +15-point gain (+50%) on LegalBench, demonstrating substantial performance improvement. Crucially, all model sizes retain zero degradation on MMLU, confirming that domain-specific legal instruction tuning preserves general-purpose capabilities. The LawInstruct dataset is publicly released to foster reproducible research in legal AI.
๐ Abstract
Instruction tuning is an important step in making language models useful for direct user interaction. However, the legal domain is underrepresented in typical instruction datasets (e.g., only 10 out of 1600+ tasks in Super-NaturalInstructions). To study whether instruction tuning on legal datasets is necessary for strong legal reasoning, we aggregate 58 annotated legal datasets and write instructions for each, creating LawInstruct. LawInstruct covers 17 global jurisdictions, 24 languages and a total of 12M examples across diverse tasks such as legal QA, summarization of court cases, and legal argument mining. We evaluate our models on LegalBench, measuring legal reasoning across five categories in 162 challenging and realistic legal tasks, and MMLU, to measure potential drops in general reasoning capabilities. We find that legal-specific instruction tuning on Flan-T5 - yielding FLawN-T5 - improves performance on LegalBench across all model sizes, with an aggregate increase of 15 points or 50% over Flan-T5 for the base size. No model size shows performance drops in MMLU. We publish LawInstruct as a resource for further study of instruction tuning in the legal domain.