Disentangling Similarity and Relatedness in Topic Models

πŸ“… 2026-03-11
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing topic models struggle to distinguish between semantic similarity and thematic relatedness among topic words. This work introduces, for the first time, the psycholinguistic dimensions of similarity and relatedness into topic model evaluation by constructing a synthetic word-pair benchmark annotated via large language models, and training a neural scoring function to quantify these distinctions. The proposed generalizable analytical framework is systematically applied across multiple corpora and model families to assess their capacity for modeling semantic structure. Results reveal significant differences in semantic preferences among distinct topic model families and demonstrate that similarity and relatedness scores effectively predict downstream task performance, offering a novel, interpretable metric for evaluating topic coherence beyond traditional approaches.

Technology Category

Application Category

πŸ“ Abstract
The recent advancement of large language models has spurred a growing trend of integrating pre-trained language model (PLM) embeddings into topic models, fundamentally reshaping how topics capture semantic structure. Classical models such as Latent Dirichlet Allocation (LDA) derive topics from word co-occurrence statistics, whereas PLM-augmented models anchor these statistics to pre-trained embedding spaces, imposing a prior that also favours clustering of semantically similar words. This structural difference can be captured by the psycholinguistic dimensions of thematic relatedness and taxonomic similarity of the topic words. To disentangle these dimensions in topic models, we construct a large synthetic benchmark of word pairs using LLM-based annotation to train a neural scoring function. We apply this scorer to a comprehensive evaluation across multiple corpora and topic model families, revealing that different model families capture distinct semantic structure in their topics. We further demonstrate that similarity and relatedness scores successfully predict downstream task performance depending on task requirements. This paper establishes similarity and relatedness as essential axes for topic model evaluation and provides a reliable pipeline for characterising these across model families and corpora.
Problem

Research questions and friction points this paper is trying to address.

topic models
semantic similarity
thematic relatedness
disentanglement
evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

topic models
semantic similarity
thematic relatedness
pre-trained language models
neural scoring function
πŸ”Ž Similar Papers
No similar papers found.