Constrained Non-negative Matrix Factorization for Guided Topic Modeling of Minority Topics

📅 2025-05-22

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

Topic modeling often fails to identify low-frequency yet critical “minority topics” (e.g., mental health issues in online reviews). Existing domain-knowledge-guided approaches rely on overly restrictive prior assumptions, hindering automatic discovery of topic structure. This paper proposes a doubly constrained non-negative matrix factorization (NMF) method that jointly incorporates seed-word guidance and topic popularity constraints—enabling data-driven discovery of minority topics and co-modeling of majority topics without pre-specifying the number or partitioning of topics. Optimization leverages Karush–Kuhn–Tucker (KKT) conditions and multiplicative updates, with Jensen–Shannon divergence used for evaluation. On synthetic data, the method significantly improves topic purity (+23.6%) and normalized mutual information (+18.4%). Empirical evaluation on YouTube vlog comments successfully identifies and interprets mental health–related minority topics, demonstrating both effectiveness and domain adaptability.

Technology Category

Application Category

📝 Abstract

Topic models often fail to capture low-prevalence, domain-critical themes, so-called minority topics, such as mental health themes in online comments. While some existing methods can incorporate domain knowledge, such as expected topical content, methods allowing guidance may require overly detailed expected topics, hindering the discovery of topic divisions and variation. We propose a topic modeling solution via a specially constrained NMF. We incorporate a seed word list characterizing minority content of interest, but we do not require experts to pre-specify their division across minority topics. Through prevalence constraints on minority topics and seed word content across topics, we learn distinct data-driven minority topics as well as majority topics. The constrained NMF is fitted via Karush-Kuhn-Tucker (KKT) conditions with multiplicative updates. We outperform several baselines on synthetic data in terms of topic purity, normalized mutual information, and also evaluate topic quality using Jensen-Shannon divergence (JSD). We conduct a case study on YouTube vlog comments, analyzing viewer discussion of mental health content; our model successfully identifies and reveals this domain-relevant minority content.

Problem

Research questions and friction points this paper is trying to address.

Capturing low-prevalence domain-critical minority topics

Reducing reliance on overly detailed expert guidance

Identifying distinct minority and majority topics automatically

Innovation

Methods, ideas, or system contributions that make the work stand out.

Constrained NMF for guided topic modeling

Seed word list for minority topic identification

Prevalence constraints enhance topic purity

🔎 Similar Papers

No similar papers found.