Breaking It Down: Domain-Aware Semantic Segmentation for Retrieval Augmented Generation

📅 2025-11-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional fixed-length and recursive chunking methods in retrieval-augmented generation (RAG) often disrupt semantic coherence, while the impact of semantic chunking on generation quality remains systematically unassessed. This paper introduces two domain-aware semantic chunking methods—Projected Similarity Chunking (PSC) and Metric Fusion Chunking (MFC)—and presents the first systematic investigation into how semantic chunking jointly affects retrieval accuracy and generation quality, including its cross-domain generalizability. We establish a multi-dimensional evaluation framework using PubMedQA and full-text PMC documents, integrating diverse embedding models. Experiments demonstrate that our methods achieve up to a 24× improvement in Mean Reciprocal Rank (MRR), significant gains in Hits@k, faster inference than mainstream chunking libraries, and superior generation quality across multiple benchmarks. The proposed approaches provide a reproducible, generalizable technical pathway for semantic chunking in RAG.

Technology Category

Application Category

📝 Abstract
Document chunking is a crucial component of Retrieval-Augmented Generation (RAG), as it directly affects the retrieval of relevant and precise context. Conventional fixed-length and recursive splitters often produce arbitrary, incoherent segments that fail to preserve semantic structure. Although semantic chunking has gained traction, its influence on generation quality remains underexplored. This paper introduces two efficient semantic chunking methods, Projected Similarity Chunking (PSC) and Metric Fusion Chunking (MFC), trained on PubMed data using three different embedding models. We further present an evaluation framework that measures the effect of chunking on both retrieval and generation by augmenting PubMedQA with full-text PubMed Central articles. Our results show substantial retrieval improvements (24x with PSC) in MRR and higher Hits@k on PubMedQA. We provide a comprehensive analysis, including statistical significance and response-time comparisons with common chunking libraries. Despite being trained on a single domain, PSC and MFC also generalize well, achieving strong out-of-domain generation performance across multiple datasets. Overall, our findings confirm that our semantic chunkers, especially PSC, consistently deliver superior performance.
Problem

Research questions and friction points this paper is trying to address.

Improves document chunking for RAG systems
Evaluates chunking impact on retrieval and generation
Introduces domain-aware semantic segmentation methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Projected Similarity Chunking for semantic segmentation
Metric Fusion Chunking trained on PubMed data
Evaluation framework using PubMedQA and full-text articles
🔎 Similar Papers
No similar papers found.
A
Aparajitha Allamraju
International Institute of Information Technology, Hyderabad, India
M
Maitreya Prafulla Chitale
International Institute of Information Technology, Hyderabad, India
H
Hiranmai Sri Adibhatla
International Institute of Information Technology, Hyderabad, India
Rahul Mishra
Rahul Mishra
Assistant Professor, IIIT Hyderabad, India
Deep LearningNatural Language ProcessingInformation Retrieval
Manish Shrivastava
Manish Shrivastava
International Institute of Information Technology Hyderabad
Natural Language ProcessingMachine LearningMachine TranslationCross Lingual IRMultilingual Question Answering