🤖 AI Summary
Current large language models (LLMs) struggle to accurately extract fragmented information from complex real-world documents (e.g., academic papers, reports) and dynamically construct structured tables, often producing disorganized, non-auditable paragraph-style outputs.
Method: We introduce AOE—the first bilingual, dynamic schema generation benchmark for text-to-table conversion—spanning three domains and 11 tasks, requiring models to adaptively infer context-sensitive table schemas from inputs. AOE departs from conventional fixed-schema paradigms by incorporating multi-length inputs, traceable reasoning steps, and deep knowledge integration, supported by human-crafted diverse queries and gold-standard structured answers.
Contribution/Results: Extensive evaluation reveals significant performance gaps across state-of-the-art open- and closed-source LLMs on AOE, exposing fundamental weaknesses in structured reasoning and information organization. These findings highlight critical bottlenecks and provide concrete directions for advancing robust, schema-agnostic table generation capabilities.
📝 Abstract
With the emergence of large language models (LLMs), there is an expectation that LLMs can effectively extract explicit information from complex real-world documents (e.g., papers, reports). However, most LLMs generate paragraph-style answers that are chaotic, disorganized, and untraceable. To bridge this gap, we introduce the Arranged and Organized Extraction Benchmark (AOE), a new bilingual benchmark with data and documents of varying lengths designed to systematically evaluate the ability of LLMs to comprehend fragmented documents and reconstruct isolated information into one organized table. Unlike conventional text-to-table tasks, which rely on fixed schema and narrow task domains, AOE includes 11 carefully crafted tasks across three diverse domains, requiring models to generate context-specific schema tailored to varied input queries. In the experiment, we evaluated both open-source and closed-source state-of-the-art LLMs. The results show that even the most advanced models struggled significantly. The benchmark is available at https://huggingface.co/datasets/tianyumyum/AOE.