SegNSP: Revisiting Next Sentence Prediction for Linear Text Segmentation

📅 2026-01-07

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

155K/year

🤖 AI Summary

This study addresses the challenge that linear text segmentation often fails to accurately identify semantically coherent paragraph boundaries, thereby limiting the performance of downstream NLP tasks. The authors reformulate the problem as a Next Sentence Prediction (NSP) task, implicitly detecting topic boundaries by modeling inter-sentence continuity without requiring explicit topic labels. The core contributions include a label-agnostic NSP framework, a segment-aware loss function, and a hard negative sampling strategy, collectively eliminating reliance on task-specific supervision signals. Evaluated on the CitiLink-Minutes and WikiSection datasets, the proposed model achieves B-F₁ scores of 0.79 and 0.65, respectively, substantially outperforming existing reproducible baselines.

Technology Category

Application Category

📝 Abstract

Linear text segmentation is a long-standing problem in natural language processing (NLP), focused on dividing continuous text into coherent and semantically meaningful units. Despite its importance, the task remains challenging due to the complexity of defining topic boundaries, the variability in discourse structure, and the need to balance local coherence with global context. These difficulties hinder downstream applications such as summarization, information retrieval, and question answering. In this work, we introduce SegNSP, framing linear text segmentation as a next sentence prediction (NSP) task. Although NSP has largely been abandoned in modern pre-training, its explicit modeling of sentence-to-sentence continuity makes it a natural fit for detecting topic boundaries. We propose a label-agnostic NSP approach, which predicts whether the next sentence continues the current topic without requiring explicit topic labels, and enhance it with a segmentation-aware loss combined with harder negative sampling to better capture discourse continuity. Unlike recent proposals that leverage NSP alongside auxiliary topic classification, our approach avoids task-specific supervision. We evaluate our model against established baselines on two datasets, CitiLink-Minutes, for which we establish the first segmentation benchmark, and WikiSection. On CitiLink-Minutes, SegNSP achieves a B-$F_1$ of 0.79, closely aligning with human-annotated topic transitions, while on WikiSection it attains a B-F$_1$ of 0.65, outperforming the strongest reproducible baseline, TopSeg, by 0.17 absolute points. These results demonstrate competitive and robust performance, highlighting the effectiveness of modeling sentence-to-sentence continuity for improving segmentation quality and supporting downstream NLP applications.

Problem

Research questions and friction points this paper is trying to address.

linear text segmentation

topic boundary detection

discourse structure

next sentence prediction

coherence modeling

Innovation

Methods, ideas, or system contributions that make the work stand out.

next sentence prediction

text segmentation

discourse continuity