fPLSA: Learning Semantic Structures in Document Collections Using Foundation Models

📅 2024-10-07

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

This work addresses semantic structure modeling for document collections. We propose an iterative Probabilistic Latent Semantic Analysis (PLSA) method integrated with foundation models. Our approach leverages large language model (LLM) embeddings to perform context-aware, iterative clustering and labeling of text segments within document-level contexts—marking the first integration of foundation model embeddings into the PLSA framework. The resulting hierarchical semantic labels are both interpretable and actionable, serving dual roles: (1) encoding structural semantics and (2) guiding hierarchical prompt-based sampling. Experiments across story generation, mathematical reasoning, and multi-step reasoning tasks demonstrate that our method significantly outperforms existing labeling approaches in text reconstruction quality, answer accuracy (hit rate), and output diversity.

Technology Category

Application Category

📝 Abstract

Humans have the ability to learn new tasks by inferring high-level concepts from existing solution, then manipulating these concepts in lieu of the raw data. Can we automate this process by deriving latent semantic structures in a document collection using foundation models? We introduce fPLSA, a foundation-model-based Probabilistic Latent Semantic Analysis (PLSA) method that iteratively clusters and tags document segments based on document-level contexts. These tags can be used to model the structure of given documents and for hierarchical sampling of new texts. Our experiments on story writing, math, and multi-step reasoning datasets demonstrate that fPLSA tags help reconstruct the original texts better than existing tagging methods. Moreover, when used for hierarchical sampling, fPLSA produces more diverse outputs with a higher likelihood of hitting the correct answer than direct sampling and hierarchical sampling with existing tagging methods.

Problem

Research questions and friction points this paper is trying to address.

Inducing high-level semantic structures from documents

Modeling latent document structures using clustering

Improving hierarchical text sampling for correct solutions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Clusters and tags document segments iteratively

Models latent structure using document-level contexts

Enables hierarchical sampling for new text generation

🔎 Similar Papers

AMR-RE: Abstract Meaning Representations for Retrieval-Based In-Context Learning in Relation Extraction