🤖 AI Summary
Existing cross-lingual document alignment (CLDA) methods rely on metadata or pooled sentence embeddings, limiting their ability to model fine-grained alignments and suffering from context-window constraints—especially in low-resource Indian languages. This paper introduces Pralekha, the first large-scale evaluation benchmark for Hindi-family document alignment, covering 11 Indian languages paired with English and over 2 million documents. We propose the Document Alignment Coefficient (DAC), a novel metric integrating multi-granularity (sentence- and block-level) alignment, cross-lingual embedding similarity, and a three-dimensional evaluation framework to overcome limitations of conventional pooling strategies. Experiments demonstrate that DAC improves precision by 20–30% and F1-score by 15–20% under noisy conditions, significantly outperforming baseline metrics. Pralekha establishes a reproducible, comparable evaluation infrastructure for document-level parallel corpus mining in low-resource language settings.
📝 Abstract
Mining parallel document pairs poses a significant challenge because existing sentence embedding models often have limited context windows, preventing them from effectively capturing document-level information. Another overlooked issue is the lack of concrete evaluation benchmarks comprising high-quality parallel document pairs for assessing document-level mining approaches, particularly for Indic languages. In this study, we introduce Pralekha, a large-scale benchmark for document-level alignment evaluation. Pralekha includes over 2 million documents, with a 1:2 ratio of unaligned to aligned pairs, covering 11 Indic languages and English. Using Pralekha, we evaluate various document-level mining approaches across three dimensions: the embedding models, the granularity levels, and the alignment algorithm. To address the challenge of aligning documents using sentence and chunk-level alignments, we propose a novel scoring method, Document Alignment Coefficient (DAC). DAC demonstrates substantial improvements over baseline pooling approaches, particularly in noisy scenarios, achieving average gains of 20-30% in precision and 15-20% in F1 score. These results highlight DAC's effectiveness in parallel document mining for Indic languages.