Finding Needles in Emb(a)dding Haystacks: Legal Document Retrieval via Bagging and SVR Ensembles

📅 2025-01-09

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

Retrieving relevant documents from German legal corpora poses challenges due to domain-specific language and sparse relevance signals. Method: This paper proposes a lightweight, training-free recall method for German legal document retrieval. It frames retrieval as multiple “needle-in-a-haystack” binary classification tasks within a pre-trained text embedding space. Crucially, it introduces the first unsupervised ranking framework combining Support Vector Regression (SVR) ensembles with bagging, augmented by a binary classification voting mechanism. Contribution/Results: The approach eliminates the need for fine-tuning large deep learning models while significantly improving recall for sparsely relevant documents. Evaluated on the GerDaLIR benchmark, it achieves a recall of 0.849—surpassing state-of-the-art baselines (0.803 and 0.829). This demonstrates the effectiveness and practicality of non-deep-learning paradigms for specialized-domain retrieval tasks.

Technology Category

Application Category

📝 Abstract

We introduce a retrieval approach leveraging Support Vector Regression (SVR) ensembles, bootstrap aggregation (bagging), and embedding spaces on the German Dataset for Legal Information Retrieval (GerDaLIR). By conceptualizing the retrieval task in terms of multiple binary needle-in-a-haystack subtasks, we show improved recall over the baselines (0.849>0.803 | 0.829) using our voting ensemble, suggesting promising initial results, without training or fine-tuning any deep learning models. Our approach holds potential for further enhancement, particularly through refining the encoding models and optimizing hyperparameters.

Problem

Research questions and friction points this paper is trying to address.

Information Retrieval

Legal Data Sets

Precision and Efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Support Vector Regression

Data Resampling Techniques

Enhanced Information Retrieval

🔎 Similar Papers

Leveraging Large Language Models for Relevance Judgments in Legal Case Retrieval