S3 - Semantic Signal Separation

📅 2024-06-13
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing topic modeling approaches suffer from computational inefficiency, result instability, and heavy reliance on preprocessing—while the bag-of-words assumption fails to capture contextual and semantic relationships. To address these limitations, we propose NeuroTopic: a novel topic modeling paradigm grounded in neural semantic embeddings and blind source separation (BSS). Its core innovation redefines topics as orthogonal semantic axes in the Sentence-BERT embedding space and employs manifold-aware independent component analysis (ICA) to automatically disentangle latent topics—bypassing conventional text cleaning, tokenization, and stopword removal entirely. This work is the first to systematically integrate BSS theory into topic modeling. NeuroTopic achieves superior topic coherence and diversity alongside millisecond-scale inference speed. It consistently outperforms state-of-the-art context-aware models across multiple benchmark datasets and has been open-sourced as part of the Turftopic library.

Technology Category

Application Category

📝 Abstract
Topic models are useful tools for discovering latent semantic structures in large textual corpora. Topic modeling historically relied on bag-of-words representations of language. This approach makes models sensitive to the presence of stop words and noise, and does not utilize potentially useful contextual information. Recent efforts have been oriented at incorporating contextual neural representations in topic modeling and have been shown to outperform classical topic models. These approaches are, however, typically slow, volatile and still require preprocessing for optimal results. We present Semantic Signal Separation ($S^3$), a theory-driven topic modeling approach in neural embedding spaces. $S^3$ conceptualizes topics as independent axes of semantic space, and uncovers these with blind-source separation. Our approach provides the most diverse, highly coherent topics, requires no preprocessing, and is demonstrated to be the fastest contextually sensitive topic model to date. We offer an implementation of $S^3$, among other approaches, in the Turftopic Python package.
Problem

Research questions and friction points this paper is trying to address.

Improving topic modeling efficiency and speed in large corpora
Eliminating heavy preprocessing for contextual topic models
Enhancing topic coherence and diversity in neural embeddings
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Independent Component Analysis for topic modeling
Operates directly on contextualized document embeddings
No preprocessing required for optimal performance
🔎 Similar Papers
No similar papers found.
Márton Kardos
Márton Kardos
Junior Developer, Center for Humanities Computing, Aarhus University
nlptopic modelingBayesian machine learningmodel interpretability
J
Jan Kostkan
Aarhus University
A
Arnault-Quentin Vermillet
Aarhus University
K
Kristoffer L. Nielbo
Aarhus University
K
Kenneth C. Enevoldsen
Aarhus University
R
Roberta Rocca
Aarhus University