🤖 AI Summary
Existing chunking strategies in retrieval-augmented generation (RAG) struggle to effectively handle the structural characteristics of tabular data. This work proposes a Structure-aware Table Chunking (STC) framework that, for the first time, integrates table structure into the RAG chunking process. STC constructs a hierarchical Row Tree representation at the row-cell level and employs a structure-aligned token-constrained splitting method combined with an overlap-free greedy merging algorithm to generate dense, non-overlapping semantic chunks that preserve intra-row field relationships. Evaluated on the MAUD dataset, STC reduces the number of chunks by up to 56% compared to baseline methods while achieving an MRR of 0.5945 and Recall@1 of 0.754, significantly enhancing both retrieval performance and token efficiency.
📝 Abstract
Tabular documents such as CSV and Excel files are widely used in enterprise data pipelines, yet existing chunking strategies for retrieval-augmented generation (RAG) are primarily designed for unstructured text and do not account for tabular structure. We propose a structure-aware tabular chunking (STC) framework that operates on row-level units by constructing a hierarchical Row Tree representation, where each row is encoded as a key-value block. STC performs token-constrained splitting aligned with structural boundaries and applies overlap-free greedy merging to produce dense, non-overlapping chunks. This design preserves semantic relationships between fields within a row while improving token utilization and reducing fragmentation. Across evaluations on the MAUD dataset, STC reduces chunk count by up to 40% and 56% compared to standard recursive and key-value based baselines, respectively, while improving token utilization and processing efficiency. In retrieval benchmarks, STC improves MRR from 0.3576 to 0.5945 in a hybrid setting and increases Recall@1 from 0.366 to 0.754 in BM25-only retrieval. These results demonstrate that preserving structure during chunking improves retrieval performance, highlighting the importance of structure-aware chunking for RAG over tabular data.