ShredBench: Evaluating the Semantic Reasoning Capabilities of Multimodal LLMs in Document Reconstruction

📅 2026-04-26

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

Current evaluations of Visual Rich Document Understanding (VRDU) predominantly rely on fully structured documents, which inadequately assess the semantic reconstruction capabilities of multimodal large language models (MLLMs) under content fragmentation. This work proposes ShredBench, the first automated benchmark specifically designed for fragmented document reconstruction. It employs a Markdown-based pipeline to generate samples across four document types—Chinese, English, code, and tables—at three levels of fragmentation granularity, and introduces standardized metrics such as Normalized Edit Distance (NED) for evaluation. Experimental results reveal that while leading MLLMs perform well on intact documents, their performance degrades substantially in fragmented settings, particularly exhibiting deficiencies in fine-grained cross-modal continuity reasoning. ShredBench supports flexible expansion of text sources and effectively mitigates data contamination risks.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models (MLLMs) have achieved remarkable performance in Visually Rich Document Understanding (VRDU) tasks, but their capabilities are mainly evaluated on pristine, well-structured document images. We consider content restoration from shredded fragments, a challenging VRDU setting that requires integrating visual pattern recognition with semantic reasoning under significant content discontinuities. To facilitate systematic evaluation of complex VRDU tasks, we introduce ShredBench, a benchmark supported by an automated generation pipeline that renders fragmented documents directly from Markdown. The proposed pipeline ensures evaluation validity by allowing the flexible integration of latest or unseen textual sources to prevent training data contamination. ShredBench assesses four scenarios (English, Chinese, Code, Table) with three fragmentation granularities (8, 12, 16 pieces). Empirical evaluations on state-of-the-art MLLMs reveal a significant performance gap: The method is effective on intact documents; however, once the document is shredded, restoration becomes a significant challenge, with NED dropping sharply as fragmentation increases. Our findings highlight that current MLLMs lack the fine-grained cross-modal reasoning required to bridge visual discontinuities, identifying a critical gap in robust VRDU research.

Problem

Research questions and friction points this paper is trying to address.

Multimodal LLMs

Document Reconstruction

Semantic Reasoning

Visually Rich Document Understanding

Shredded Documents

Innovation

Methods, ideas, or system contributions that make the work stand out.

ShredBench

Multimodal LLMs

Document Reconstruction