When Numbers Tell Half the Story: Human-Metric Alignment in Topic Model Evaluation

📅 2026-03-02

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

This work addresses the challenge that existing automatic metrics struggle to effectively evaluate topic models in specialized domains due to frequent misalignment with human judgment. To bridge this gap, we propose a novel task called Topic Word Mixing (TWM), which assesses inter-topic distinctiveness by having human annotators judge whether a given set of words originates from a single topic or a mixture of topics. We complement this with a word intrusion task to evaluate intra-topic coherence. Leveraging nearly 4,000 human-annotated samples from philosophical and scientific corpora, we compare six prominent topic models—LDA, NMF, Top2Vec, BERTopic, CFMF, and CFMF-emb—across both automatic and human evaluations. Our experiments demonstrate that TWM reliably captures human-perceived topic distinctiveness and aligns strongly with diversity metrics, whereas conventional automatic coherence measures perform poorly in specialized contexts.

Technology Category

Application Category

📝 Abstract

Topic models uncover latent thematic structures in text corpora, yet evaluating their quality remains challenging, particularly in specialized domains. Existing methods often rely on automated metrics like topic coherence and diversity, which may not fully align with human judgment. Human evaluation tasks, such as word intrusion, provide valuable insights but are costly and primarily validated on general-domain corpora. This paper introduces Topic Word Mixing (TWM), a novel human evaluation task assessing inter-topic distinctness by testing whether annotators can distinguish between word sets from single or mixed topics. TWM complements word intrusion's focus on intra-topic coherence and provides a human-grounded counterpart to diversity metrics. We evaluate six topic models - both statistical and embedding-based (LDA, NMF, Top2Vec, BERTopic, CFMF, CFMF-emb) - comparing automated metrics with human evaluation methods based on nearly 4,000 annotations from a domain-specific corpus of philosophy of science publications. Our findings reveal that word intrusion and coherence metrics do not always align, particularly in specialized domains, and that TWM captures human-perceived distinctness while appearing to align with diversity metrics. We release the annotated dataset and task generation code. This work highlights the need for evaluation frameworks bridging automated and human assessments, particularly for domain-specific corpora.

Problem

Research questions and friction points this paper is trying to address.

topic model evaluation

human-metric alignment

domain-specific corpora

topic distinctness

evaluation metrics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Topic Word Mixing

human evaluation

topic model evaluation