From Outliers to Topics in Language Models: Anticipating Trends in News Corpora

📅 2025-09-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional topic modeling treats outlier words as noise, overlooking their potential as early indicators of emerging themes—particularly challenging in dynamic news corpora for timely trend detection. Method: We propose a novel paradigm that reinterprets outlier words as weak semantic signals of nascent topics. To systematically track their evolution into coherent themes, we integrate language model embeddings (BERT, mBERT, XLM-R) with incremental cumulative clustering, enabling temporal tracing of semantic drift and theme formation. Contribution/Results: Evaluated on bilingual English–French news data centered on corporate social responsibility and climate change, our approach demonstrates cross-lingual and cross-model robustness. Crucially, we provide the first empirical evidence of systematic outlier-word thematic evolution and achieve prospective identification of emerging topics—averaging 2.3 weeks ahead of conventional baselines. The framework delivers an interpretable, reusable computational pipeline for early trend detection and actionable insight generation.

Technology Category

Application Category

📝 Abstract
This paper examines how outliers, often dismissed as noise in topic modeling, can act as weak signals of emerging topics in dynamic news corpora. Using vector embeddings from state-of-the-art language models and a cumulative clustering approach, we track their evolution over time in French and English news datasets focused on corporate social responsibility and climate change. The results reveal a consistent pattern: outliers tend to evolve into coherent topics over time across both models and languages.
Problem

Research questions and friction points this paper is trying to address.

Analyzing outliers as weak signals of emerging topics
Tracking outlier evolution in multilingual news corpora
Investigating outlier-to-topic transformation patterns across languages
Innovation

Methods, ideas, or system contributions that make the work stand out.

Using vector embeddings from language models
Applying cumulative clustering to track evolution
Identifying outliers as weak signals for topics
Evangelia Zve
Evangelia Zve
Sorbonne University
topic modelingsocial network analysisdisinformation
Benjamin Icard
Benjamin Icard
LIP6, Sorbonne Université, CNRS
Disinformation AnalysisArtificial IntelligenceLogicComputational LinguisticsExplainability
A
Alice Breton
LIP6, Sorbonne University, CNRS, France
Lila Sainero
Lila Sainero
Ingénieure de recherche, LIP6
G
Gauvain Bourgne
LIP6, Sorbonne University, CNRS, France
J
Jean-Gabriel Ganascia
LIP6, Sorbonne University, CNRS, France