Improving Neural Topic Modeling with Semantically-Grounded Soft Label Distributions

📅 2026-02-20

📈 Citations: 0

✨ Influential: 0

career value

153K/year

🤖 AI Summary

This work addresses the limitations of traditional neural topic models, which rely on the bag-of-words assumption, ignore contextual semantics, and are vulnerable to data sparsity. The authors propose a novel approach that leverages large language models to generate next-word probability distributions under tailored prompts, projects these distributions onto a predefined vocabulary to construct semantically rich soft labels, and uses them as supervision signals to guide the topic model in reconstructing documents from the language model’s hidden states. This is the first method to integrate language model–driven semantic soft labels into topic modeling, substantially improving topic coherence and purity. Extensive experiments demonstrate superior performance across three benchmark datasets and significant gains over existing approaches in semantic document retrieval tasks.

Technology Category

Application Category

📝 Abstract

Traditional neural topic models are typically optimized by reconstructing the document's Bag-of-Words (BoW) representations, overlooking contextual information and struggling with data sparsity. In this work, we propose a novel approach to construct semantically-grounded soft label targets using Language Models (LMs) by projecting the next token probabilities, conditioned on a specialized prompt, onto a pre-defined vocabulary to obtain contextually enriched supervision signals. By training the topic models to reconstruct the soft labels using the LM hidden states, our method produces higher-quality topics that are more closely aligned with the underlying thematic structure of the corpus. Experiments on three datasets show that our method achieves substantial improvements in topic coherence, purity over existing baselines. Additionally, we also introduce a retrieval-based metric, which shows that our approach significantly outperforms existing methods in identifying semantically similar documents, highlighting its effectiveness for retrieval-oriented applications.

Problem

Research questions and friction points this paper is trying to address.

neural topic modeling

data sparsity

contextual information

topic coherence

semantic alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Neural Topic Modeling

Semantically-Grounded Soft Labels

Language Models