Beyond Ranked Lists: The SARAL Framework for Cross-Lingual Document Set Retrieval

📅 2025-11-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current cross-lingual information retrieval (CLIR) systems return only ranked document lists, failing to support the retrieval of semantically coherent, topically complete document sets. To address this limitation, we propose SARAL—the first end-to-end multilingual document set retrieval framework. SARAL integrates machine translation, domain-adaptive retrieval, and multi-document summarization to shift CLIR from isolated document ranking toward generating semantically cohesive, relevant document collections. Its core innovation lies in modeling joint semantic matching between queries and document sets—rather than assessing individual document relevance in isolation. Evaluated on six tasks across Persian, Kazakh, and Georgian in Phase III of the DARPA MATERIAL program, SARAL achieves state-of-the-art performance on five, demonstrating significant improvements in both accuracy and practical utility for cross-lingual document set retrieval.

Technology Category

Application Category

📝 Abstract
Machine Translation for English Retrieval of Information in Any Language (MATERIAL) is an IARPA initiative targeted to advance the state of cross-lingual information retrieval (CLIR). This report provides a detailed description of Information Sciences Institute's (ISI's) Summarization and domain-Adaptive Retrieval Across Language's (SARAL's) effort for MATERIAL. Specifically, we outline our team's novel approach to handle CLIR with emphasis in developing an approach amenable to retrieve a query-relevant document extit{set}, and not just a ranked document-list. In MATERIAL's Phase-3 evaluations, SARAL exceeded the performance of other teams in five out of six evaluation conditions spanning three different languages (Farsi, Kazakh, and Georgian).
Problem

Research questions and friction points this paper is trying to address.

Develop cross-lingual retrieval for document sets
Move beyond ranked lists to query-relevant sets
Handle multiple languages including Farsi Kazakh Georgian
Innovation

Methods, ideas, or system contributions that make the work stand out.

Retrieves query-relevant document sets
Uses cross-lingual summarization and adaptation
Outperforms ranked-list approaches in evaluations
🔎 Similar Papers
No similar papers found.
S
Shantanu Agarwal
Information Sciences Institute, University of Southern California
J
Joel Barry
Information Sciences Institute, University of Southern California
E
Elizabeth Boschee
Information Sciences Institute, University of Southern California
Scott Miller
Scott Miller
Professor of Electrical and Computer Engineering, Texas A&M University
Communication Theory