From Documents to Segments: A Contextual Reformulation for Topic Assignment

📅 2026-05-17

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

Traditional topic models assign a single topic to an entire document, which struggles to accurately represent multi-topic texts and often leads to topic mixing and reduced interpretability. This work proposes a Segment-Based Topic Assignment (SBTA) framework that, for the first time, refines the granularity of topic modeling from the document level to semantically coherent text segments. To support this approach, we construct the SemEval-STM dataset by combining large language model–based automatic segmentation with human refinement to produce high-quality segments, and introduce a segment-level word intrusion task to enable fine-grained evaluation. Experiments demonstrate that SBTA significantly improves topic clustering quality and interpretability across multiple topic models and evaluation metrics, confirming its effectiveness and scalability.

📝 Abstract

Traditional topic modeling assigns a single topic to each document. In practice, however, many real-world documents, such as product reviews or open-ended survey responses, contain multiple distinct topics. This mismatch often leads to topic contamination, where unrelated themes are merged into a single topic, making it difficult to identify documents that truly focus on a specific subject. We address this issue by introducing segment-based topic allocation (SBTA), a reformulation of topic modeling that assigns topics not to entire documents, but to segments: short, coherent spans of text that each express a single theme. By modeling topical structure at the segment level, our approach yields cleaner and more interpretable topics and better supports analysis of multi-theme documents. To support systematic evaluation, we construct a SemEval-STM, a new dataset inspired by aspect-based sentiment analysis. Documents are first decomposed into topical segments using large language models (LLMs), followed by human refinement to ensure segment quality. We also propose a segment-level extension of the word intrusion task, enabling human evaluation of topical coherence at the granularity where topics are actually assigned. Across multiple models and evaluation metrics, we show that SBTA improves clustering quality and interpretability. Overall, this work provides a practical, scalable framework for fine-grained topic analysis in heterogeneous text corpora where documents naturally span multiple topics. URL: https://huggingface.co/datasets/LG-AI-Research/SemEval-STM

Problem

Research questions and friction points this paper is trying to address.

topic modeling

topic contamination

multi-theme documents

document segmentation

topic assignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

segment-based topic modeling

topic coherence

multi-theme documents