Poisson-Process Topic Model for Integrating Knowledge from Pre-trained Language Models

📅 2025-03-22

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

Traditional topic models rely on word-frequency statistics, neglecting contextual semantics. To address this, we propose the first topic modeling framework formulated as a Poisson point process on an embedding space: documents are mapped to sequences of contextualized word embeddings generated by a frozen pre-trained LLM, and K topic intensity measures—each β-Hölder smooth—are learned via convex combinations. Our method requires no LLM fine-tuning and is plug-and-play compatible with existing topic models. We establish minimax-optimal convergence rates theoretically, and show that when β ≤ 1, the rate matches the information-theoretic lower bound. Computation is enabled via net-rounding discretization and kernel density estimation. Extensive experiments on multiple benchmark datasets demonstrate significant improvements over state-of-the-art topic models, validating both the effectiveness and generalizability of context-aware topic representations.

Technology Category

Application Category

📝 Abstract

Topic modeling is traditionally applied to word counts without accounting for the context in which words appear. Recent advancements in large language models (LLMs) offer contextualized word embeddings, which capture deeper meaning and relationships between words. We aim to leverage such embeddings to improve topic modeling. We use a pre-trained LLM to convert each document into a sequence of word embeddings. This sequence is then modeled as a Poisson point process, with its intensity measure expressed as a convex combination of $K$ base measures, each corresponding to a topic. To estimate these topics, we propose a flexible algorithm that integrates traditional topic modeling methods, enhanced by net-rounding applied before and kernel smoothing applied after. One advantage of this framework is that it treats the LLM as a black box, requiring no fine-tuning of its parameters. Another advantage is its ability to seamlessly integrate any traditional topic modeling approach as a plug-in module, without the need for modifications Assuming each topic is a $eta$-H""{o}lder smooth intensity measure on the embedded space, we establish the rate of convergence of our method. We also provide a minimax lower bound and show that the rate of our method matches with the lower bound when $etaleq 1$. Additionally, we apply our method to several datasets, providing evidence that it offers an advantage over traditional topic modeling approaches.

Problem

Research questions and friction points this paper is trying to address.

Integrate contextual word embeddings from LLMs into topic modeling

Develop a Poisson-process model for topic intensity measures

Prove convergence rates and compare with traditional topic modeling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages pre-trained LLM for contextual embeddings

Uses Poisson point process for topic intensity

Integrates traditional topic modeling with enhancements

🔎 Similar Papers

A Large Language Model Guided Topic Refinement Mechanism for Short Text Modeling