Rule-Based Approaches to Atomic Sentence Extraction

📅 2026-01-01
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF

career value

167K/year
🤖 AI Summary
This study addresses the challenge posed by complex sentences—containing multiple semantic units—that degrade performance in tasks such as information retrieval and automated reasoning, necessitating their decomposition into atomic sentences expressing single semantic propositions. The work presents the first systematic analysis of syntactic constructions—including relative clauses, adverbial clauses, coordination, and passive voice—and their impact on atomic sentence extraction. To tackle these challenges, the authors propose an interpretable rule-based system grounded in dependency parsing, leveraging spaCy to extract subject–predicate–object triples and subordinate clauses. Evaluated on the WikiSplit dataset and a manually annotated sample of 100 instances, the system achieves ROUGE-1 F1 = 0.6714 and BERTScore F1 = 0.5898, demonstrating moderate-to-high consistency at lexical, structural, and semantic levels. This approach advances the state of the art by enhancing interpretability and enabling precise failure attribution, addressing key limitations of existing methods.

Technology Category

Application Category

📝 Abstract
Natural language often combines multiple ideas into complex sentences. Atomic sentence extraction, the task of decomposing complex sentences into simpler sentences that each express a single idea, improves performance in information retrieval, question answering, and automated reasoning systems. Previous work has formalized the"split-and-rephrase"task and established evaluation metrics, and machine learning approaches using large language models have improved extraction accuracy. However, these methods lack interpretability and provide limited insight into which linguistic structures cause extraction failures. Although some studies have explored dependency-based extraction of subject-verb-object triples and clauses, no principled analysis has examined which specific clause structures and dependencies lead to extraction difficulties. This study addresses this gap by analyzing how complex sentence structures, including relative clauses, adverbial clauses, coordination patterns, and passive constructions, affect the performance of rule-based atomic sentence extraction. Using the WikiSplit dataset, we implemented dependency-based extraction rules in spaCy, generated 100 gold=standard atomic sentence sets, and evaluated performance using ROUGE and BERTScore. The system achieved ROUGE-1 F1 = 0.6714, ROUGE-2 F1 = 0.478, ROUGE-L F1 = 0.650, and BERTScore F1 = 0.5898, indicating moderate-to-high lexical, structural, and semantic alignment. Challenging structures included relative clauses, appositions, coordinated predicates, adverbial clauses, and passive constructions. Overall, rule-based extraction is reasonably accurate but sensitive to syntactic complexity.
Problem

Research questions and friction points this paper is trying to address.

atomic sentence extraction
complex sentence decomposition
syntactic complexity
rule-based extraction
clause structures
Innovation

Methods, ideas, or system contributions that make the work stand out.

atomic sentence extraction
rule-based approach
syntactic complexity
dependency parsing
sentence decomposition
🔎 Similar Papers
No similar papers found.