When Numbers Tell Half the Story: Human-Metric Alignment in Topic Model Evaluation

📅 2026-03-02
📈 Citations: 0
✹ Influential: 0
📄 PDF
đŸ€– AI Summary
This work addresses the challenge that existing automatic metrics struggle to effectively evaluate topic models in specialized domains due to frequent misalignment with human judgment. To bridge this gap, we propose a novel task called Topic Word Mixing (TWM), which assesses inter-topic distinctiveness by having human annotators judge whether a given set of words originates from a single topic or a mixture of topics. We complement this with a word intrusion task to evaluate intra-topic coherence. Leveraging nearly 4,000 human-annotated samples from philosophical and scientific corpora, we compare six prominent topic models—LDA, NMF, Top2Vec, BERTopic, CFMF, and CFMF-emb—across both automatic and human evaluations. Our experiments demonstrate that TWM reliably captures human-perceived topic distinctiveness and aligns strongly with diversity metrics, whereas conventional automatic coherence measures perform poorly in specialized contexts.

Technology Category

Application Category

📝 Abstract
Topic models uncover latent thematic structures in text corpora, yet evaluating their quality remains challenging, particularly in specialized domains. Existing methods often rely on automated metrics like topic coherence and diversity, which may not fully align with human judgment. Human evaluation tasks, such as word intrusion, provide valuable insights but are costly and primarily validated on general-domain corpora. This paper introduces Topic Word Mixing (TWM), a novel human evaluation task assessing inter-topic distinctness by testing whether annotators can distinguish between word sets from single or mixed topics. TWM complements word intrusion's focus on intra-topic coherence and provides a human-grounded counterpart to diversity metrics. We evaluate six topic models - both statistical and embedding-based (LDA, NMF, Top2Vec, BERTopic, CFMF, CFMF-emb) - comparing automated metrics with human evaluation methods based on nearly 4,000 annotations from a domain-specific corpus of philosophy of science publications. Our findings reveal that word intrusion and coherence metrics do not always align, particularly in specialized domains, and that TWM captures human-perceived distinctness while appearing to align with diversity metrics. We release the annotated dataset and task generation code. This work highlights the need for evaluation frameworks bridging automated and human assessments, particularly for domain-specific corpora.
Problem

Research questions and friction points this paper is trying to address.

topic model evaluation
human-metric alignment
domain-specific corpora
topic distinctness
evaluation metrics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Topic Word Mixing
human evaluation
topic model evaluation
inter-topic distinctness
domain-specific corpus
🔎 Similar Papers
No similar papers found.
T
Thibault Prouteau
Université de Lorraine, CNRS, LORIA, Nancy (France)
F
Francis Lareau
Université de Sherbrooke, Dept of Philosophy and Applied Ethics, Sherbrooke (Canada); Université du Québec à Montréal, Dept of Philosophy & CIRST, Montréal (Canada)
Nicolas Dugué
Nicolas Dugué
Associate professor, University of Le Mans
InterpretabilityComplex networksComputational linguistics
J
Jean-Charles Lamirel
Université de Lorraine, CNRS, LORIA, Nancy (France); Université de Strasbourg, Strasbourg (France)
C
Christophe Malaterre
Université du Québec à Montréal, Dept of Philosophy & CIRST, Montréal (Canada)