Infini-News: Efficiently Queryable Access to 1.3 Billion Processed Common Crawl News Articles

πŸ“… 2026-05-18
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

210K/year
πŸ€– AI Summary
This study addresses the challenges of acquiring large-scale news corpora, which are often hindered by the high cost of commercial archives and the substantial storage and processing demands of open-source datasets such as CC-News. The authors comprehensively clean and parse metadata from the entire CC-News archive dating back to August 2016, performing language identification on 1.35 billion news articles using GlotLID, lingua, and CommonLingua, and introducing multi-source geolocation annotations that cover 83.4% of the articles across 222 countries. Leveraging suffix arrays, they construct an Infini-gram index enabling sub-second full-text pattern matching for arbitrary queries, thereby substantially lowering the barrier to cross-national, longitudinal media research.
πŸ“ Abstract
Large-scale news corpora support a wide range of research in Computational Social Science and NLP, yet access remains constrained: commercial archives impose prohibitive costs and licensing restrictions, while open alternatives like Common Crawl's CC-News require terabyte-scale storage and computationally intensive processing. We present Infini-News, a retrieval toolkit and index for the entire CC-News archive from August 2016 to the latest available snapshot. Our contributions are threefold. First, we extract, clean the text, and parse the structured metadata of over 1.35B articles. Second, we enrich the corpus with language detection using three frontier language classifiers (GlotLID, lingua, and CommonLingua), and with multi-source geographic attribution that resolves a country of origin for 83.4% of articles across 222 countries. Third, we construct Infini-gram indexes: suffix-array structures that let researchers search the full archive for arbitrary text patterns in sub-second time. Together, these resources lower the barrier to longitudinal, cross-national media research.
Problem

Research questions and friction points this paper is trying to address.

news corpus
data access
computational social science
NLP
Common Crawl
Innovation

Methods, ideas, or system contributions that make the work stand out.

Infini-gram
language detection
geographic attribution
suffix-array indexing
CC-News processing
πŸ”Ž Similar Papers
No similar papers found.