Pralekha: An Indic Document Alignment Evaluation Benchmark

📅 2024-11-28

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Existing cross-lingual document alignment (CLDA) methods rely on metadata or pooled sentence embeddings, limiting their ability to model fine-grained alignments and suffering from context-window constraints—especially in low-resource Indian languages. This paper introduces Pralekha, the first large-scale evaluation benchmark for Hindi-family document alignment, covering 11 Indian languages paired with English and over 2 million documents. We propose the Document Alignment Coefficient (DAC), a novel metric integrating multi-granularity (sentence- and block-level) alignment, cross-lingual embedding similarity, and a three-dimensional evaluation framework to overcome limitations of conventional pooling strategies. Experiments demonstrate that DAC improves precision by 20–30% and F1-score by 15–20% under noisy conditions, significantly outperforming baseline metrics. Pralekha establishes a reproducible, comparable evaluation infrastructure for document-level parallel corpus mining in low-resource language settings.

Technology Category

Application Category

📝 Abstract

Mining parallel document pairs poses a significant challenge because existing sentence embedding models often have limited context windows, preventing them from effectively capturing document-level information. Another overlooked issue is the lack of concrete evaluation benchmarks comprising high-quality parallel document pairs for assessing document-level mining approaches, particularly for Indic languages. In this study, we introduce Pralekha, a large-scale benchmark for document-level alignment evaluation. Pralekha includes over 2 million documents, with a 1:2 ratio of unaligned to aligned pairs, covering 11 Indic languages and English. Using Pralekha, we evaluate various document-level mining approaches across three dimensions: the embedding models, the granularity levels, and the alignment algorithm. To address the challenge of aligning documents using sentence and chunk-level alignments, we propose a novel scoring method, Document Alignment Coefficient (DAC). DAC demonstrates substantial improvements over baseline pooling approaches, particularly in noisy scenarios, achieving average gains of 20-30% in precision and 15-20% in F1 score. These results highlight DAC's effectiveness in parallel document mining for Indic languages.

Problem

Research questions and friction points this paper is trying to address.

Challenges in cross-lingual document alignment for Indic languages

Limitations of metadata and sentence embedding methods for alignment

Need for effective document-level representation and alignment metrics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces PRALEKHA benchmark for Indic languages

Proposes Document Alignment Coefficient (DAC) metric

Aligns documents by matching smaller chunks

🔎 Similar Papers

No similar papers found.

Moveworks

*Our compensation package includes a market competitive salary, equity for all full time roles, exceptional benefits, and, for applicable roles, commissions or bonus plans. Ultimately, in determining pay, final offers may vary from the amount listed based on geography, the role’s scope and complexity, the candidate’s experience and expertise, and other factors.

San Diego, California, USA

Sr. MLE, GAI Search Relevance - JB0069884

Moveworks

San Diego, California

Authors to Follow