From 124 Million Tokens to 1,021 Neologisms: A Large-Scale Pipeline for Automatic Neologism Detection

πŸ“… 2026-05-07
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

187K/year
πŸ€– AI Summary
This study addresses the challenge of efficiently and accurately identifying neologisms in massive text corpora by proposing the first neologism definition framework that integrates grammatical and supra-grammatical morphological theories, along with an end-to-end scalable detection pipeline. The approach combines rule-based filtering, ensemble classification via multiple large language models, and expert human validation to identify 1,021 candidate neologisms from 124.6 million unique tokens in a corpus of 52.7 billion Reddit posts, ultimately confirming 599 genuine neologisms with a precision of 58.7%. Key contributions include a novel four-category typology for neologisms, a high-precision methodology for large-scale corpus analysis, and the public release of all code, scripts, and annotated datasets.
πŸ“ Abstract
We present a scalable, modular pipeline for automatic neologism detection that combines rule-based filtering with LLM classification. The pipeline is grounded in two complementary word-formation frameworks, grammatical and extra-grammatical morphology, which jointly define the scope of what counts as a neologism and inform a four-class classification scheme (neologism, entity, foreign, none). While designed to be modular and transferable at the architectural level, the pipeline is instantiated on 527 million English-language Reddit posts spanning 2005-2024. From this corpus, we extract 124.6 million unique tokens and reduce them by over 99.99% to yield 1,021 neologism candidates, a set small enough for manual expert verification. Multiple LLMs independently classify each candidate via majority vote, with a final verification step, revealing substantial cross-model disagreement and highlighting the challenge of operationalizing neologism detection at scale. Manual annotation of all 1,021 candidates confirms that 599 (58.7%) are genuine lexical innovations. The pipeline code, vocabulary compilation scripts, and the annotated candidate list are available at https://github.com/DiegoRossini/neologism-pipeline.
Problem

Research questions and friction points this paper is trying to address.

neologism detection
lexical innovation
large-scale text analysis
automatic word discovery
Innovation

Methods, ideas, or system contributions that make the work stand out.

neologism detection
large language models
morphological frameworks
scalable pipeline
lexical innovation
πŸ”Ž Similar Papers
No similar papers found.