🤖 AI Summary
Retrieving relevant documents from German legal corpora poses challenges due to domain-specific language and sparse relevance signals.
Method: This paper proposes a lightweight, training-free recall method for German legal document retrieval. It frames retrieval as multiple “needle-in-a-haystack” binary classification tasks within a pre-trained text embedding space. Crucially, it introduces the first unsupervised ranking framework combining Support Vector Regression (SVR) ensembles with bagging, augmented by a binary classification voting mechanism.
Contribution/Results: The approach eliminates the need for fine-tuning large deep learning models while significantly improving recall for sparsely relevant documents. Evaluated on the GerDaLIR benchmark, it achieves a recall of 0.849—surpassing state-of-the-art baselines (0.803 and 0.829). This demonstrates the effectiveness and practicality of non-deep-learning paradigms for specialized-domain retrieval tasks.
📝 Abstract
We introduce a retrieval approach leveraging Support Vector Regression (SVR) ensembles, bootstrap aggregation (bagging), and embedding spaces on the German Dataset for Legal Information Retrieval (GerDaLIR). By conceptualizing the retrieval task in terms of multiple binary needle-in-a-haystack subtasks, we show improved recall over the baselines (0.849>0.803 | 0.829) using our voting ensemble, suggesting promising initial results, without training or fine-tuning any deep learning models. Our approach holds potential for further enhancement, particularly through refining the encoding models and optimizing hyperparameters.