🤖 AI Summary
Semantic segmentation of complex structured text—containing tables, code blocks, placeholders, and other non-linguistic elements—remains challenging for conventional sentence- or paragraph-level approaches, which fail to model such heterogeneous content. Method: We propose a token-level segmentation framework based on Reinforcement Learning with Verifiable Rewards (RLVR). Instead of generating full segments, the model emits only paragraph-start tokens; original-text localization then reconstructs segment content, mitigating hallucination by avoiding explicit token generation. A reward function jointly optimizes reconstruction fidelity and semantic alignment, while sequence perturbation generates intermediate candidate solutions to alleviate entropy collapse. Results: Our 1.7B-parameter model outperforms large language models under few-shot prompting on LLM prompt segmentation tasks, achieving superior accuracy, generalization across domains, and inference efficiency compared to supervised fine-tuning baselines.
📝 Abstract
As structured texts become increasingly complex across diverse domains -- from technical reports to generative AI prompts -- the need for text segmentation into semantically meaningful components becomes critical. Such texts often contain elements beyond plain language, including tables, code snippets, and placeholders, which conventional sentence- or paragraph-level segmentation methods cannot handle effectively. To address this challenge, we propose BoundRL, a novel and efficient approach that jointly performs token-level text segmentation and label prediction for long structured texts. Instead of generating complete contents for each segment, it generates only a sequence of starting tokens and reconstructs the complete contents by locating these tokens within the original texts, thereby reducing inference costs by orders of magnitude and minimizing hallucination. To adapt the model for the output format, BoundRL~performs reinforcement learning with verifiable rewards (RLVR) with a specifically designed reward that jointly optimizes document reconstruction fidelity and semantic alignment. To mitigate entropy collapse, it further constructs intermediate candidates by systematically perturbing a fraction of generated sequences of segments to create stepping stones toward higher-quality solutions. To demonstrate BoundRL's effectiveness on particularly challenging structured texts, we focus evaluation on complex prompts used for LLM applications. Experiments show that BoundRL enables small language models (1.7B parameters) to outperform few-shot prompting of much larger models. Moreover, RLVR with our designed reward yields significant improvements over supervised fine-tuning, and incorporating intermediate candidates further improves both performance and generalization.