Beyond Isolated Dots: Benchmarking Structured Table Construction as Deep Knowledge Extraction

📅 2025-07-22

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Current large language models (LLMs) struggle to accurately extract fragmented information from complex real-world documents (e.g., academic papers, reports) and dynamically construct structured tables, often producing disorganized, non-auditable paragraph-style outputs. Method: We introduce AOE—the first bilingual, dynamic schema generation benchmark for text-to-table conversion—spanning three domains and 11 tasks, requiring models to adaptively infer context-sensitive table schemas from inputs. AOE departs from conventional fixed-schema paradigms by incorporating multi-length inputs, traceable reasoning steps, and deep knowledge integration, supported by human-crafted diverse queries and gold-standard structured answers. Contribution/Results: Extensive evaluation reveals significant performance gaps across state-of-the-art open- and closed-source LLMs on AOE, exposing fundamental weaknesses in structured reasoning and information organization. These findings highlight critical bottlenecks and provide concrete directions for advancing robust, schema-agnostic table generation capabilities.

Technology Category

Application Category

📝 Abstract

With the emergence of large language models (LLMs), there is an expectation that LLMs can effectively extract explicit information from complex real-world documents (e.g., papers, reports). However, most LLMs generate paragraph-style answers that are chaotic, disorganized, and untraceable. To bridge this gap, we introduce the Arranged and Organized Extraction Benchmark (AOE), a new bilingual benchmark with data and documents of varying lengths designed to systematically evaluate the ability of LLMs to comprehend fragmented documents and reconstruct isolated information into one organized table. Unlike conventional text-to-table tasks, which rely on fixed schema and narrow task domains, AOE includes 11 carefully crafted tasks across three diverse domains, requiring models to generate context-specific schema tailored to varied input queries. In the experiment, we evaluated both open-source and closed-source state-of-the-art LLMs. The results show that even the most advanced models struggled significantly. The benchmark is available at https://huggingface.co/datasets/tianyumyum/AOE.

Problem

Research questions and friction points this paper is trying to address.

Evaluate LLMs' ability to extract and organize fragmented document information into structured tables

Assess performance of LLMs in generating context-specific schema for diverse queries

Benchmark LLMs' comprehension of complex documents across multiple domains

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bilingual benchmark for structured extraction

Context-specific schema generation

Evaluates LLMs on fragmented documents

🔎 Similar Papers

TableBench: A Comprehensive and Complex Benchmark for Table Question Answering