fPLSA: Learning Semantic Structures in Document Collections Using Foundation Models

๐Ÿ“… 2024-10-07
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses semantic structure modeling for document collections. We propose an iterative Probabilistic Latent Semantic Analysis (PLSA) method integrated with foundation models. Our approach leverages large language model (LLM) embeddings to perform context-aware, iterative clustering and labeling of text segments within document-level contextsโ€”marking the first integration of foundation model embeddings into the PLSA framework. The resulting hierarchical semantic labels are both interpretable and actionable, serving dual roles: (1) encoding structural semantics and (2) guiding hierarchical prompt-based sampling. Experiments across story generation, mathematical reasoning, and multi-step reasoning tasks demonstrate that our method significantly outperforms existing labeling approaches in text reconstruction quality, answer accuracy (hit rate), and output diversity.

Technology Category

Application Category

๐Ÿ“ Abstract
Humans have the ability to learn new tasks by inferring high-level concepts from existing solution, then manipulating these concepts in lieu of the raw data. Can we automate this process by deriving latent semantic structures in a document collection using foundation models? We introduce fPLSA, a foundation-model-based Probabilistic Latent Semantic Analysis (PLSA) method that iteratively clusters and tags document segments based on document-level contexts. These tags can be used to model the structure of given documents and for hierarchical sampling of new texts. Our experiments on story writing, math, and multi-step reasoning datasets demonstrate that fPLSA tags help reconstruct the original texts better than existing tagging methods. Moreover, when used for hierarchical sampling, fPLSA produces more diverse outputs with a higher likelihood of hitting the correct answer than direct sampling and hierarchical sampling with existing tagging methods.
Problem

Research questions and friction points this paper is trying to address.

Inducing high-level semantic structures from documents
Modeling latent document structures using clustering
Improving hierarchical text sampling for correct solutions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Clusters and tags document segments iteratively
Models latent structure using document-level contexts
Enables hierarchical sampling for new text generation
๐Ÿ”Ž Similar Papers
No similar papers found.
W
Weijia Xu
Microsoft Research, Redmond, WA 98052, USA
Nebojsa Jojic
Nebojsa Jojic
Microsoft Research
Machine LearningComputer VisionComputational BiologyImmunologyNatural Language Processing
N
N. L. Roux
Microsoft Research, Redmond, WA 98052, USA