Evaluating the Limits of Large Language Models in Multilingual Legal Reasoning

📅 2025-09-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study systematically evaluates the capabilities of large language models (LLMs) in multilingual, cross-jurisdictional, and adversarial legal reasoning—domains where standardized benchmarks are lacking. To address this gap, we introduce an open-source, modular multilingual legal benchmark framework integrating datasets including LEXam and XNLI, and propose an LLM-as-a-Judge methodology for human-aligned automated evaluation. Experiments employ LLaMA and Gemini series models, incorporating character- and word-level adversarial perturbations across 12 languages. Results reveal that average accuracy on legal tasks falls below 50%, substantially underperforming general-purpose benchmarks; Gemini outperforms LLaMA by ~24 percentage points; and model performance is highly sensitive to prompt engineering and syntactic similarity to English. Our core contributions are: (1) the first structured, multilingual legal reasoning evaluation framework, and (2) empirical evidence linking linguistic structure—particularly syntactic proximity to English—to model performance.

Technology Category

Application Category

📝 Abstract
In an era dominated by Large Language Models (LLMs), understanding their capabilities and limitations, especially in high-stakes fields like law, is crucial. While LLMs such as Meta's LLaMA, OpenAI's ChatGPT, Google's Gemini, DeepSeek, and other emerging models are increasingly integrated into legal workflows, their performance in multilingual, jurisdictionally diverse, and adversarial contexts remains insufficiently explored. This work evaluates LLaMA and Gemini on multilingual legal and non-legal benchmarks, and assesses their adversarial robustness in legal tasks through character and word-level perturbations. We use an LLM-as-a-Judge approach for human-aligned evaluation. We moreover present an open-source, modular evaluation pipeline designed to support multilingual, task-diverse benchmarking of any combination of LLMs and datasets, with a particular focus on legal tasks, including classification, summarization, open questions, and general reasoning. Our findings confirm that legal tasks pose significant challenges for LLMs with accuracies often below 50% on legal reasoning benchmarks such as LEXam, compared to over 70% on general-purpose tasks like XNLI. In addition, while English generally yields more stable results, it does not always lead to higher accuracy. Prompt sensitivity and adversarial vulnerability is also shown to persist across languages. Finally, a correlation is found between the performance of a language and its syntactic similarity to English. We also observe that LLaMA is weaker than Gemini, with the latter showing an average advantage of about 24 percentage points across the same task. Despite improvements in newer LLMs, challenges remain in deploying them reliably for critical, multilingual legal applications.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM limitations in multilingual legal reasoning tasks
Assessing adversarial robustness through text perturbation techniques
Developing open-source pipeline for legal task benchmarking
Innovation

Methods, ideas, or system contributions that make the work stand out.

Modular pipeline for multilingual legal evaluation
LLM-as-a-Judge approach for human-aligned assessment
Adversarial testing through character and word perturbations
🔎 Similar Papers
No similar papers found.
A
Antreas Ioannou
Delft University of Technology
A
Andreas Shiamishis
Delft University of Technology
Nora Hollenstein
Nora Hollenstein
University of Zurich
Natural Language ProcessingCognitive ScienceMachine Learning
Nezihe Merve Gürel
Nezihe Merve Gürel
Delft University of Technology