🤖 AI Summary
Traditional topic modeling treats outlier words as noise, overlooking their potential as early indicators of emerging themes—particularly challenging in dynamic news corpora for timely trend detection.
Method: We propose a novel paradigm that reinterprets outlier words as weak semantic signals of nascent topics. To systematically track their evolution into coherent themes, we integrate language model embeddings (BERT, mBERT, XLM-R) with incremental cumulative clustering, enabling temporal tracing of semantic drift and theme formation.
Contribution/Results: Evaluated on bilingual English–French news data centered on corporate social responsibility and climate change, our approach demonstrates cross-lingual and cross-model robustness. Crucially, we provide the first empirical evidence of systematic outlier-word thematic evolution and achieve prospective identification of emerging topics—averaging 2.3 weeks ahead of conventional baselines. The framework delivers an interpretable, reusable computational pipeline for early trend detection and actionable insight generation.
📝 Abstract
This paper examines how outliers, often dismissed as noise in topic modeling, can act as weak signals of emerging topics in dynamic news corpora. Using vector embeddings from state-of-the-art language models and a cumulative clustering approach, we track their evolution over time in French and English news datasets focused on corporate social responsibility and climate change. The results reveal a consistent pattern: outliers tend to evolve into coherent topics over time across both models and languages.