Merging Embedded Topics with Optimal Transport for Online Topic Modeling on Data Streams

📅 2025-04-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Online topic modeling and evolution tracking for social media text streams pose challenges in handling non-stationary, high-volume data with abrupt semantic shifts. Method: We propose StreamETM, an end-to-end online method integrating the Embedded Topic Model (ETM) with Unbalanced Optimal Transport (UOT). It employs differentiable, stable, and scalable UOT-based fusion of ETM parameters across batches and couples a lightweight KL-divergence-based online change-point detection mechanism for unsupervised topic drift identification. The model is optimized via stochastic gradient descent, enabling continual learning and dynamic parameter updates. Contribution/Results: StreamETM achieves state-of-the-art performance on both synthetic and real-world streaming datasets, significantly outperforming baselines—including LDA, OLDA, and DynamicETM—in topic coherence, temporal responsiveness, and change-point recall. Its UOT-driven parameter fusion constitutes the first application of unbalanced optimal transport to differentiable, scalable online topic model adaptation.

Technology Category

Application Category

📝 Abstract
Topic modeling is a key component in unsupervised learning, employed to identify topics within a corpus of textual data. The rapid growth of social media generates an ever-growing volume of textual data daily, making online topic modeling methods essential for managing these data streams that continuously arrive over time. This paper introduces a novel approach to online topic modeling named StreamETM. This approach builds on the Embedded Topic Model (ETM) to handle data streams by merging models learned on consecutive partial document batches using unbalanced optimal transport. Additionally, an online change point detection algorithm is employed to identify shifts in topics over time, enabling the identification of significant changes in the dynamics of text streams. Numerical experiments on simulated and real-world data show StreamETM outperforming competitors.
Problem

Research questions and friction points this paper is trying to address.

Develops online topic modeling for continuous data streams
Merges topic models using unbalanced optimal transport
Detects topic shifts over time via change point detection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Merges Embedded Topic Models with optimal transport
Uses online change point detection algorithm
Handles data streams via partial document batches
F
Federica Granese
Université Côte d’Azur, Inria
B
Benjamin Navet
Université Côte d’Azur, Inria
S
S. Villata
Université Côte d’Azur, Inria, CNRS
Charles Bouveyron
Charles Bouveyron
Professor of Statistics, Chair in AI at Institut 3IA Côte d'Azur, Inria, Université Côte d'Azur
Artificial IntelligenceComputational StatisticsStatistical LearningModel-based Clustering