Disentangling Similarity and Relatedness in Topic Models

📅 2026-03-11

📈 Citations: 0

✨ Influential: 0

career value

232K/year

🤖 AI Summary

Existing topic models struggle to distinguish between semantic similarity and thematic relatedness among topic words. This work introduces, for the first time, the psycholinguistic dimensions of similarity and relatedness into topic model evaluation by constructing a synthetic word-pair benchmark annotated via large language models, and training a neural scoring function to quantify these distinctions. The proposed generalizable analytical framework is systematically applied across multiple corpora and model families to assess their capacity for modeling semantic structure. Results reveal significant differences in semantic preferences among distinct topic model families and demonstrate that similarity and relatedness scores effectively predict downstream task performance, offering a novel, interpretable metric for evaluating topic coherence beyond traditional approaches.

Technology Category

Application Category

📝 Abstract

The recent advancement of large language models has spurred a growing trend of integrating pre-trained language model (PLM) embeddings into topic models, fundamentally reshaping how topics capture semantic structure. Classical models such as Latent Dirichlet Allocation (LDA) derive topics from word co-occurrence statistics, whereas PLM-augmented models anchor these statistics to pre-trained embedding spaces, imposing a prior that also favours clustering of semantically similar words. This structural difference can be captured by the psycholinguistic dimensions of thematic relatedness and taxonomic similarity of the topic words. To disentangle these dimensions in topic models, we construct a large synthetic benchmark of word pairs using LLM-based annotation to train a neural scoring function. We apply this scorer to a comprehensive evaluation across multiple corpora and topic model families, revealing that different model families capture distinct semantic structure in their topics. We further demonstrate that similarity and relatedness scores successfully predict downstream task performance depending on task requirements. This paper establishes similarity and relatedness as essential axes for topic model evaluation and provides a reliable pipeline for characterising these across model families and corpora.

Problem

Research questions and friction points this paper is trying to address.

topic models

semantic similarity

thematic relatedness

disentanglement

evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

topic models

semantic similarity

thematic relatedness