From 124 Million Tokens to 1,021 Neologisms: A Large-Scale Pipeline for Automatic Neologism Detection

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

159K/year

🤖 AI Summary

This study addresses the challenge of efficiently and accurately identifying neologisms in massive text corpora by proposing the first neologism definition framework that integrates grammatical and supra-grammatical morphological theories, along with an end-to-end scalable detection pipeline. The approach combines rule-based filtering, ensemble classification via multiple large language models, and expert human validation to identify 1,021 candidate neologisms from 124.6 million unique tokens in a corpus of 52.7 billion Reddit posts, ultimately confirming 599 genuine neologisms with a precision of 58.7%. Key contributions include a novel four-category typology for neologisms, a high-precision methodology for large-scale corpus analysis, and the public release of all code, scripts, and annotated datasets.

📝 Abstract

We present a scalable, modular pipeline for automatic neologism detection that combines rule-based filtering with LLM classification. The pipeline is grounded in two complementary word-formation frameworks, grammatical and extra-grammatical morphology, which jointly define the scope of what counts as a neologism and inform a four-class classification scheme (neologism, entity, foreign, none). While designed to be modular and transferable at the architectural level, the pipeline is instantiated on 527 million English-language Reddit posts spanning 2005-2024. From this corpus, we extract 124.6 million unique tokens and reduce them by over 99.99% to yield 1,021 neologism candidates, a set small enough for manual expert verification. Multiple LLMs independently classify each candidate via majority vote, with a final verification step, revealing substantial cross-model disagreement and highlighting the challenge of operationalizing neologism detection at scale. Manual annotation of all 1,021 candidates confirms that 599 (58.7%) are genuine lexical innovations. The pipeline code, vocabulary compilation scripts, and the annotated candidate list are available at https://github.com/DiegoRossini/neologism-pipeline.

Problem

Research questions and friction points this paper is trying to address.

neologism detection

lexical innovation

large-scale text analysis

automatic word discovery

Innovation

Methods, ideas, or system contributions that make the work stand out.

neologism detection

large language models

morphological frameworks

scalable pipeline