Leveraging Digitized Newspapers to Collect Summarization Data in Low-Resource Languages

📅 2025-11-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the scarcity of high-quality multi-document summarization (MDS) data for low-resource languages, this paper proposes an automated methodology for constructing MDS datasets leveraging historical newspaper front-page teasers—naturally occurring, editor-written summaries. Our approach combines layout analysis and NLP techniques to identify, extract, and refine these teasers into high-fidelity reference summaries. The phenomenon of such editorial teasers is empirically validated across seven languages, confirming cross-lingual applicability. As a key outcome, we introduce HEBTEASESUM—the first human-verified, multi-source MDS dataset for Hebrew—comprising thousands of real-world news documents paired with expert-curated summaries. Experiments demonstrate that our pipeline is efficient, scalable, and substantially lowers the barrier to MDS data creation for low-resource languages. This work establishes a novel paradigm for multilingual summarization research and delivers critical infrastructure to advance the field.

Technology Category

Application Category

📝 Abstract
High quality summarization data remains scarce in under-represented languages. However, historical newspapers, made available through recent digitization efforts, offer an abundant source of untapped, naturally annotated data. In this work, we present a novel method for collecting naturally occurring summaries via Front-Page Teasers, where editors summarize full length articles. We show that this phenomenon is common across seven diverse languages and supports multi-document summarization. To scale data collection, we develop an automatic process, suited to varying linguistic resource levels. Finally, we apply this process to a Hebrew newspaper title, producing HEBTEASESUM, the first dedicated multi-document summarization dataset in Hebrew.
Problem

Research questions and friction points this paper is trying to address.

Collecting summarization data for low-resource languages using digitized newspapers
Automating extraction of multi-document summaries from front-page newspaper teasers
Creating first Hebrew multi-document summarization dataset from historical newspapers
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses front-page teasers as natural summaries
Automates data collection for varying resource levels
Creates first Hebrew multi-document summarization dataset
🔎 Similar Papers
No similar papers found.
N
Noam Dahan
The Hebrew University of Jerusalem
O
Omer Kidron
The Hebrew University of Jerusalem
Gabriel Stanovsky
Gabriel Stanovsky
The Hebrew University of Jerusalem
Computational Linguistics