🤖 AI Summary
This study addresses the challenging problem of automatically segmenting real-world multi-page heterogeneous document bundles, which often exhibit complex configurations such as page concatenation, misordering, and interleaving. We formally define the document bundle segmentation task for the first time, encompassing boundary detection, document type classification, and page order recovery. To facilitate research in this area, we introduce DocSplit, the first comprehensive benchmark dataset covering five multimodal document categories, along with standardized evaluation metrics tailored to these complex scenarios. Leveraging a multimodal large language model, our approach integrates visual and textual information for end-to-end understanding. Experimental results reveal that existing models perform substantially below desired levels on this task, underscoring the need for further innovation. DocSplit thus establishes a critical foundation for advancing document intelligence in domains such as legal, financial, and healthcare applications.
📝 Abstract
Document understanding in real-world applications often requires processing heterogeneous, multi-page document packets containing multiple documents stitched together. Despite recent advances in visual document understanding, the fundamental task of document packet splitting, which involves separating a document packet into individual units, remains largely unaddressed. We present the first comprehensive benchmark dataset, DocSplit, along with novel evaluation metrics for assessing the document packet splitting capabilities of large language models. DocSplit comprises five datasets of varying complexity, covering diverse document types, layouts, and multimodal settings. We formalize the DocSplit task, which requires models to identify document boundaries, classify document types, and maintain correct page ordering within a document packet. The benchmark addresses real-world challenges, including out-of-order pages, interleaved documents, and documents lacking clear demarcations. We conduct extensive experiments evaluating multimodal LLMs on our datasets, revealing significant performance gaps in current models' ability to handle complex document splitting tasks. The DocSplit benchmark datasets and proposed novel evaluation metrics provide a systematic framework for advancing document understanding capabilities essential for legal, financial, healthcare, and other document-intensive domains. We release the datasets to facilitate future research in document packet processing.