Evaluating LLM-based Approaches to Legal Citation Prediction: Domain-specific Pre-training, Fine-tuning, or RAG? A Benchmark and an Australian Law Case Study

📅 2024-12-09

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This work addresses the fine-grained contextual understanding challenge of legal citation prediction by introducing AusLaw-Cite, the first large-scale Australian legal citation benchmark (55K instances). Methodologically, it systematically evaluates prompt engineering, retrieval-augmented generation (RAG), query expansion, re-ranking, and supervised fine-tuning, while proposing domain-adapted embeddings, hybrid re-ranking, and voting-based ensemble strategies. Results show that instruction-tuned open-source LLMs substantially outperform zero-shot LLMs and generic RAG baselines; retrieval quality is critically dependent on fine-grained legal corpora and specialized re-rankers; the best hybrid re-ranking approach achieves state-of-the-art performance, yet overall accuracy remains ~50 percentage points below human expert levels—highlighting the task’s inherent difficulty and underscoring the benchmark’s value for advancing legal AI research.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have demonstrated strong potential across legal tasks, yet the problem of legal citation prediction remains under-explored. At its core, this task demands fine-grained contextual understanding and precise identification of relevant legislation or precedent. We introduce the AusLaw Citation Benchmark, a real-world dataset comprising 55k Australian legal instances and 18,677 unique citations which to the best of our knowledge is the first of its scale and scope. We then conduct a systematic benchmarking across a range of solutions: (i) standard prompting of both general and law-specialised LLMs, (ii) retrieval-only pipelines with both generic and domain-specific embeddings, (iii) supervised fine-tuning, and (iv) several hybrid strategies that combine LLMs with retrieval augmentation through query expansion, voting ensembles, or re-ranking. Results show that neither general nor law-specific LLMs suffice as stand-alone solutions, with performance near zero. Instruction tuning (of even a generic open-source LLM) on task-specific dataset is among the best performing solutions. We highlight that database granularity along with the type of embeddings play a critical role in retrieval-based approaches, with hybrid methods which utilise a trained re-ranker delivering the best results. Despite this, a performance gap of nearly 50% remains, underscoring the value of this challenging benchmark as a rigorous test-bed for future research in legal-domain.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM-based methods for legal citation prediction

Assessing domain-specific pre-training, fine-tuning, and RAG approaches

Benchmarking performance on Australian law citation dataset

Innovation

Methods, ideas, or system contributions that make the work stand out.

Domain-specific pre-training for legal tasks

Hybrid strategies combining LLMs with retrieval

Instruction tuning on task-specific datasets

🔎 Similar Papers

Leveraging Large Language Models for Relevance Judgments in Legal Case Retrieval