Leveraging Digitized Newspapers to Collect Summarization Data in Low-Resource Languages

📅 2025-11-18

📈 Citations: 0

✨ Influential: 0

career value

138K/year

🤖 AI Summary

To address the scarcity of high-quality multi-document summarization (MDS) data for low-resource languages, this paper proposes an automated methodology for constructing MDS datasets leveraging historical newspaper front-page teasers—naturally occurring, editor-written summaries. Our approach combines layout analysis and NLP techniques to identify, extract, and refine these teasers into high-fidelity reference summaries. The phenomenon of such editorial teasers is empirically validated across seven languages, confirming cross-lingual applicability. As a key outcome, we introduce HEBTEASESUM—the first human-verified, multi-source MDS dataset for Hebrew—comprising thousands of real-world news documents paired with expert-curated summaries. Experiments demonstrate that our pipeline is efficient, scalable, and substantially lowers the barrier to MDS data creation for low-resource languages. This work establishes a novel paradigm for multilingual summarization research and delivers critical infrastructure to advance the field.

Technology Category

Application Category

📝 Abstract

High quality summarization data remains scarce in under-represented languages. However, historical newspapers, made available through recent digitization efforts, offer an abundant source of untapped, naturally annotated data. In this work, we present a novel method for collecting naturally occurring summaries via Front-Page Teasers, where editors summarize full length articles. We show that this phenomenon is common across seven diverse languages and supports multi-document summarization. To scale data collection, we develop an automatic process, suited to varying linguistic resource levels. Finally, we apply this process to a Hebrew newspaper title, producing HEBTEASESUM, the first dedicated multi-document summarization dataset in Hebrew.

Problem

Research questions and friction points this paper is trying to address.

Collecting summarization data for low-resource languages using digitized newspapers

Automating extraction of multi-document summaries from front-page newspaper teasers

Creating first Hebrew multi-document summarization dataset from historical newspapers

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses front-page teasers as natural summaries

Automates data collection for varying resource levels

Creates first Hebrew multi-document summarization dataset

🔎 Similar Papers

No similar papers found.