Graph Topic Modeling for Documents with Spatial or Covariate Dependencies

📅 2024-12-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses topic modeling for document collections exhibiting spatial or covariate-dependent structure. To improve the accuracy of document-level topic proportion estimation, we propose a graph-regularized topic modeling framework. Our method represents documents as nodes in a graph, where edges encode pairwise covariate similarity; graph regularization is then integrated into iterative SVD to encourage smoothness of topic distributions over neighboring nodes. This work is the first to incorporate graph regularization into the pLSI framework. We theoretically derive a high-probability upper bound on topic estimation error and design a graph-aware cross-validation strategy for adaptive selection of the regularization strength. Experiments on synthetic data and three real-world corpora demonstrate that our approach significantly outperforms state-of-the-art Bayesian methods in both topic estimation accuracy and inference speed.

Technology Category

Application Category

📝 Abstract
We address the challenge of incorporating document-level metadata into topic modeling to improve topic mixture estimation. To overcome the computational complexity and lack of theoretical guarantees in existing Bayesian methods, we extend probabilistic latent semantic indexing (pLSI), a frequentist framework for topic modeling, by incorporating document-level covariates or known similarities between documents through a graph formalism. Modeling documents as nodes and edges denoting similarities, we propose a new estimator based on a fast graph-regularized iterative singular value decomposition (SVD) that encourages similar documents to share similar topic mixture proportions. We characterize the estimation error of our proposed method by deriving high-probability bounds and develop a specialized cross-validation method to optimize our regularization parameters. We validate our model through comprehensive experiments on synthetic datasets and three real-world corpora, demonstrating improved performance and faster inference compared to existing Bayesian methods.
Problem

Research questions and friction points this paper is trying to address.

Incorporating document metadata into topic modeling
Overcoming computational complexity in Bayesian methods
Improving topic mixture estimation with graph regularization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Graph-regularized SVD for topic modeling
Incorporating document covariates via graph formalism
Fast iterative estimation with theoretical error bounds
🔎 Similar Papers
2024-04-02North American Chapter of the Association for Computational LinguisticsCitations: 2