CReSt: A Comprehensive Benchmark for Retrieval-Augmented Generation with Complex Reasoning over Structured Documents

📅 2025-05-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current RAG evaluation frameworks lack a unified assessment of complex reasoning, uncertainty-based refusal, precise citation, and layout-aware understanding in structured-document scenarios. To address this gap, we introduce StructRAG, the first bilingual (English/Korean) comprehensive RAG benchmark specifically designed for structured documents, comprising 2,245 human-annotated samples. We propose the first systematic evaluation framework that jointly assesses structure awareness, uncertainty identification, and multi-step reasoning capabilities. Furthermore, we develop a fine-grained, human-in-the-loop automated evaluation methodology with multidimensional metrics—including citation accuracy, refusal reasonableness, and layout comprehension score. Experimental results reveal severe performance imbalance across dimensions among state-of-the-art LLMs, with overall average scores below 60%, exposing fundamental bottlenecks in structured-document RAG. Both the benchmark dataset and evaluation code are publicly released to advance community research.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have made substantial progress in recent years, yet evaluating their capabilities in practical Retrieval-Augmented Generation (RAG) scenarios remains challenging. In practical applications, LLMs must demonstrate complex reasoning, refuse to answer appropriately, provide precise citations, and effectively understand document layout. These capabilities are crucial for advanced task handling, uncertainty awareness, maintaining reliability, and structural understanding. While some of the prior works address these aspects individually, there is a need for a unified framework that evaluates them collectively in practical RAG scenarios. To address this, we present CReSt (A Comprehensive Benchmark for Retrieval-Augmented Generation with Complex Reasoning over Structured Documents), a benchmark designed to assess these key dimensions holistically. CReSt comprises 2,245 human-annotated examples in English and Korean, designed to capture practical RAG scenarios that require complex reasoning over structured documents. It also introduces a tailored evaluation methodology to comprehensively assess model performance in these critical areas. Our evaluation shows that even advanced LLMs struggle to perform consistently across these dimensions, underscoring key areas for improvement. We release CReSt to support further research and the development of more robust RAG systems. The dataset and code are available at: https://github.com/UpstageAI/CReSt.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs in practical RAG scenarios with complex reasoning
Assessing model capabilities in handling structured document layouts
Providing a unified benchmark for holistic RAG performance evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified benchmark for complex RAG evaluation
Human-annotated multilingual structured document dataset
Tailored methodology for holistic performance assessment
🔎 Similar Papers
No similar papers found.
Minsoo Khang
Minsoo Khang
Upstage AI
OCRIntelligent Document Parsing
S
Sangjun Park
Upstage AI
T
Teakgyu Hong
Upstage AI
D
Dawoon Jung
Upstage AI