Semantic Component Analysis: Discovering Patterns in Short Texts Beyond Topics

📅 2024-10-28

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

To address the rigidity of the single-topic assumption, poor scalability, and insufficient robustness for low-resource languages in short-text topic modeling, this paper proposes an unsupervised multi-component topic modeling framework grounded in clustering and semantic decomposition. The method enables, for the first time in short-text settings, joint discovery of fine-grained, non-exclusive semantic components—without relying on pre-trained large language models—and supports zero-noise, cross-lingual modeling, including for low-resource languages such as Hausa. Experiments on Twitter data in English, Hausa, and Chinese demonstrate that the approach identifies over twice as many semantic components as baseline methods, with near-zero noise rates. Its coherence and diversity scores match those of BERTopic, while achieving significantly higher computational efficiency.

Technology Category

Application Category

📝 Abstract

Topic modeling is a key method in text analysis, but existing approaches are limited by assuming one topic per document or fail to scale efficiently for large, noisy datasets of short texts. We introduce Semantic Component Analysis (SCA), a novel topic modeling technique that overcomes these limitations by discovering multiple, nuanced semantic components beyond a single topic in short texts which we accomplish by introducing a decomposition step to the clustering-based topic modeling framework. We evaluate SCA on Twitter datasets in English, Hausa and Chinese. It achieves competetive coherence and diversity compared to BERTopic, while uncovering at least double the semantic components and maintaining a noise rate close to zero. Furthermore, SCA is scalable and effective across languages, including an underrepresented one.

Problem

Research questions and friction points this paper is trying to address.

Overcoming single-topic limitation in clustering-based topic modeling

Scaling topic modeling efficiently for large multilingual datasets

Improving topic coherence and diversity while reducing noise

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces multi-topic distributions to clustering-based modeling

Adds decomposition step for multiple topics per document

Outperforms BERTopic and TopicGPT in efficiency and topics

🔎 Similar Papers

A Large Language Model Guided Topic Refinement Mechanism for Short Text Modeling