Exploring Topic Trends in COVID-19 Research Literature using Non-Negative Matrix Factorization

📅 2025-03-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the thematic evolution of COVID-19 scientific literature. To address the challenge of dynamically identifying robust, interpretable topics over time, we propose an integrated analytical framework leveraging TF-IDF weighting, context-aware text preprocessing, and non-negative matrix factorization (NMF) topic modeling. Crucially, we introduce— for the first time in this domain—a systematic NMF stability analysis to objectively determine the optimal number of topics, overcoming the subjectivity and limitations of empirical heuristics. By tracking temporal distributions and semantic shifts of topics, we identify stable, semantically coherent thematic clusters and quantify their stage-wise evolutionary patterns. The method unifies static topic structure discovery with dynamic longitudinal analysis, enabling rigorous, reproducible evaluation of topic models. It provides a novel methodological foundation for pandemic-related research landscape monitoring, knowledge graph construction, and evidence-based science policy support.

Technology Category

Application Category

📝 Abstract
In this work, we apply topic modeling using Non-Negative Matrix Factorization (NMF) on the COVID-19 Open Research Dataset (CORD-19) to uncover the underlying thematic structure and its evolution within the extensive body of COVID-19 research literature. NMF factorizes the document-term matrix into two non-negative matrices, effectively representing the topics and their distribution across the documents. This helps us see how strongly documents relate to topics and how topics relate to words. We describe the complete methodology which involves a series of rigorous pre-processing steps to standardize the available text data while preserving the context of phrases, and subsequently feature extraction using the term frequency-inverse document frequency (tf-idf), which assigns weights to words based on their frequency and rarity in the dataset. To ensure the robustness of our topic model, we conduct a stability analysis. This process assesses the stability scores of the NMF topic model for different numbers of topics, enabling us to select the optimal number of topics for our analysis. Through our analysis, we track the evolution of topics over time within the CORD-19 dataset. Our findings contribute to the understanding of the knowledge structure of the COVID-19 research landscape, providing a valuable resource for future research in this field.
Problem

Research questions and friction points this paper is trying to address.

Uncover thematic structure in COVID-19 research literature
Track evolution of topics over time in CORD-19 dataset
Ensure robustness of topic model via stability analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Non-Negative Matrix Factorization for topic modeling
Applies tf-idf for feature extraction in text data
Conducts stability analysis to optimize topic count
🔎 Similar Papers
No similar papers found.