Asking For An Old Friend: Diagnosing and Mitigating Temporal Failure Modes in LLM-based Statutory Question Answering

📅 2026-05-22

📈 Citations: 0

✨ Influential: 0

career value

149K/year

🤖 AI Summary

This study addresses the challenge of temporal validity in legal question answering posed by the static knowledge and training data cutoffs of large language models, which often lead to outdated citations or recency bias. The authors systematically identify and define two distinct modes of temporal failure in legal QA, introduce the first multi-category German legal temporal QA benchmark comprising 312 expert-validated samples, and propose a retrieval-augmented generation (RAG) approach that treats temporal validity as a hard constraint by integrating factual date extraction with version-aware filtering of legal statutes. Experimental results demonstrate a significant performance drop in vanilla models on post-cutoff questions, whereas the proposed RAG method substantially improves accuracy across all question types and outperforms an unstable web-search baseline.

📝 Abstract

Large language models are increasingly used for legal research, yet their fixed training cutoffs and reliance on static parametric knowledge are at odds with the evolving nature of statutory law. We study two temporal failure modes: post-cutoff staleness, where models apply superseded rules after legislative amendments, and recency bias, where models prefer newer provisions even when a historical version governs the fact pattern. To this end, we present a benchmark of 312 expert-validated, time-sensitive German statutory QA pairs spanning three categories: Post-Cutoff Amendment Questions, Pre-Amendment Questions, and Multi-Provision Pre-Amendment Questions. We evaluate five LLMs by OpenAI, Anthropic and DeepSeek under four inference settings: Vanilla, Web-search, and two retrieval-augmented variants that enforce temporal validity via a fact date extraction and version filtering. Using an LLM-as-a-judge validated against human expert ratings, we find severe degradation in the Vanilla post-cutoff setting. Both RAG approaches substantially improve performance across all question types, while web search yields unstable gains and exhibits a marked recency bias on historically anchored tasks. Our results indicate that reliable legal QA requires treating temporal validity as a hard constraint.

Problem

Research questions and friction points this paper is trying to address.

temporal failure modes

statutory question answering

large language models

legal QA

time-sensitive reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

temporal failure modes

retrieval-augmented generation

statutory question answering