PRISM: PRIor from corpus Statistics for topic Modeling

📅 2026-03-31

📈 Citations: 0

✨ Influential: 0

career value

155K/year

🤖 AI Summary

This work addresses the challenge of generating coherent and interpretable topics in emerging or resource-constrained domains, where traditional topic models often falter due to the absence of external knowledge. The authors propose a novel method that leverages only intrinsic word co-occurrence statistics from the corpus to directly construct the Dirichlet prior parameters for Latent Dirichlet Allocation (LDA), without altering its generative process or relying on external resources. This approach represents the first effective translation of endogenous corpus statistics into a meaningful prior distribution, significantly enhancing topic coherence and interpretability. Empirical results demonstrate that the method achieves performance on par with state-of-the-art models that depend on external knowledge, particularly excelling in data-scarce scenarios, as validated on both textual corpora and single-cell RNA-seq datasets.

Technology Category

Application Category

📝 Abstract

Topic modeling seeks to uncover latent semantic structure in text, with LDA providing a foundational probabilistic framework. While recent methods often incorporate external knowledge (e.g., pre-trained embeddings), such reliance limits applicability in emerging or underexplored domains. We introduce \textbf{PRISM}, a corpus-intrinsic method that derives a Dirichlet parameter from word co-occurrence statistics to initialize LDA without altering its generative process. Experiments on text and single cell RNA-seq data show that PRISM improves topic coherence and interpretability, rivaling models that rely on external knowledge. These results underscore the value of corpus-driven initialization for topic modeling in resource-constrained settings. Code is available at: https://github.com/shaham-lab/PRISM.

Problem

Research questions and friction points this paper is trying to address.

topic modeling

external knowledge

emerging domains

underexplored domains

LDA

Innovation

Methods, ideas, or system contributions that make the work stand out.

topic modeling

corpus-intrinsic initialization

Dirichlet prior