Infini-News: Efficiently Queryable Access to 1.3 Billion Processed Common Crawl News Articles

📅 2026-05-18

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

This study addresses the challenges of acquiring large-scale news corpora, which are often hindered by the high cost of commercial archives and the substantial storage and processing demands of open-source datasets such as CC-News. The authors comprehensively clean and parse metadata from the entire CC-News archive dating back to August 2016, performing language identification on 1.35 billion news articles using GlotLID, lingua, and CommonLingua, and introducing multi-source geolocation annotations that cover 83.4% of the articles across 222 countries. Leveraging suffix arrays, they construct an Infini-gram index enabling sub-second full-text pattern matching for arbitrary queries, thereby substantially lowering the barrier to cross-national, longitudinal media research.

📝 Abstract

Large-scale news corpora support a wide range of research in Computational Social Science and NLP, yet access remains constrained: commercial archives impose prohibitive costs and licensing restrictions, while open alternatives like Common Crawl's CC-News require terabyte-scale storage and computationally intensive processing. We present Infini-News, a retrieval toolkit and index for the entire CC-News archive from August 2016 to the latest available snapshot. Our contributions are threefold. First, we extract, clean the text, and parse the structured metadata of over 1.35B articles. Second, we enrich the corpus with language detection using three frontier language classifiers (GlotLID, lingua, and CommonLingua), and with multi-source geographic attribution that resolves a country of origin for 83.4% of articles across 222 countries. Third, we construct Infini-gram indexes: suffix-array structures that let researchers search the full archive for arbitrary text patterns in sub-second time. Together, these resources lower the barrier to longitudinal, cross-national media research.

Problem

Research questions and friction points this paper is trying to address.

news corpus

data access

computational social science

NLP

Common Crawl

Innovation

Methods, ideas, or system contributions that make the work stand out.

Infini-gram

language detection

geographic attribution