Optimizing Legal Document Retrieval in Vietnamese with Semi-Hard Negative Mining

๐Ÿ“… 2025-07-19
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

217K/year
๐Ÿค– AI Summary
To address the domain-knowledge deficiency and low retrieval accuracy of large language models in Vietnamese legal document retrieval, this paper proposes a two-stage retrieval framework: (1) an efficient candidate retrieval stage using a fine-tuned Bi-Encoder, and (2) a fine-grained re-ranking stage employing a Cross-Encoder. Key innovations include semi-hard negative mining and fine-grained negative sampling, a novel Exist@m evaluation metric, and a customized loss function to mitigate training bias and enhance robustness. The lightweight, single-pass architecture achieves top-three performance in the SoICT Hackathon 2024 legal retrieval taskโ€”matching the accuracy of complex ensemble models while reducing parameter count significantly. This demonstrates the methodโ€™s effectiveness and practicality for specialized, low-resource language legal domains.

Technology Category

Application Category

๐Ÿ“ Abstract
Large Language Models (LLMs) face significant challenges in specialized domains like law, where precision and domain-specific knowledge are critical. This paper presents a streamlined two-stage framework consisting of Retrieval and Re-ranking to enhance legal document retrieval efficiency and accuracy. Our approach employs a fine-tuned Bi-Encoder for rapid candidate retrieval, followed by a Cross-Encoder for precise re-ranking, both optimized through strategic negative example mining. Key innovations include the introduction of the Exist@m metric to evaluate retrieval effectiveness and the use of semi-hard negatives to mitigate training bias, which significantly improved re-ranking performance. Evaluated on the SoICT Hackathon 2024 for Legal Document Retrieval, our team, 4Huiter, achieved a top-three position. While top-performing teams employed ensemble models and iterative self-training on large bge-m3 architectures, our lightweight, single-pass approach offered a competitive alternative with far fewer parameters. The framework demonstrates that optimized data processing, tailored loss functions, and balanced negative sampling are pivotal for building robust retrieval-augmented systems in legal contexts.
Problem

Research questions and friction points this paper is trying to address.

Enhancing legal document retrieval efficiency and accuracy
Mitigating training bias with semi-hard negative mining
Optimizing lightweight models for legal domain precision
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuned Bi-Encoder for rapid retrieval
Cross-Encoder for precise re-ranking
Semi-hard negative mining to reduce bias