🤖 AI Summary
To address the scarcity of high-quality multi-document summarization (MDS) data for low-resource languages, this paper proposes an automated methodology for constructing MDS datasets leveraging historical newspaper front-page teasers—naturally occurring, editor-written summaries. Our approach combines layout analysis and NLP techniques to identify, extract, and refine these teasers into high-fidelity reference summaries. The phenomenon of such editorial teasers is empirically validated across seven languages, confirming cross-lingual applicability. As a key outcome, we introduce HEBTEASESUM—the first human-verified, multi-source MDS dataset for Hebrew—comprising thousands of real-world news documents paired with expert-curated summaries. Experiments demonstrate that our pipeline is efficient, scalable, and substantially lowers the barrier to MDS data creation for low-resource languages. This work establishes a novel paradigm for multilingual summarization research and delivers critical infrastructure to advance the field.
📝 Abstract
High quality summarization data remains scarce in under-represented languages. However, historical newspapers, made available through recent digitization efforts, offer an abundant source of untapped, naturally annotated data. In this work, we present a novel method for collecting naturally occurring summaries via Front-Page Teasers, where editors summarize full length articles. We show that this phenomenon is common across seven diverse languages and supports multi-document summarization. To scale data collection, we develop an automatic process, suited to varying linguistic resource levels. Finally, we apply this process to a Hebrew newspaper title, producing HEBTEASESUM, the first dedicated multi-document summarization dataset in Hebrew.