Unstructured Data Analysis using LLMs: A Comprehensive Benchmark

📅 2025-10-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing unstructured data analysis (UDA) systems lack a unified, comprehensive benchmark, hindering fair comparison across heterogeneous architectures—including query interfaces, optimization strategies, and execution engines. Method: We propose UDA-Bench, the first holistic benchmark for UDA. It comprises five high-quality, cross-domain datasets with manually annotated relational views; LLM-assisted attribute extraction enhances annotation efficiency and coverage. UDA-Bench features diverse query workloads spanning multiple operators and complexity levels, enabling both end-to-end and component-level evaluation. Contribution/Results: Experiments demonstrate UDA-Bench’s discriminative power across core UDA modules—parsing, optimization, and execution—across state-of-the-art systems. It provides a reproducible, extensible, and standardized evaluation platform for UDA research, facilitating rigorous, architecture-agnostic performance assessment and systematic advancement of UDA methodologies.

Technology Category

Application Category

📝 Abstract
Nowadays, the explosion of unstructured data presents immense analytical value. Leveraging the remarkable capability of large language models (LLMs) in extracting attributes of structured tables from unstructured data, researchers are developing LLM-powered data systems for users to analyze unstructured documents as working with a database. These unstructured data analysis (UDA) systems differ significantly in all aspects, including query interfaces, query optimization strategies, and operator implementations, making it unclear which performs best in which scenario. Unfortunately, there does not exist a comprehensive benchmark that offers high-quality, large-volume, and diverse datasets as well as rich query workload to thoroughly evaluate such systems. To fill this gap, we present UDA-Bench, the first benchmark for unstructured data analysis that meets all the above requirements. Specifically, we organize a team with 30 graduate students that spends over in total 10,000 hours on curating 5 datasets from various domains and constructing a relational database view from these datasets by manual annotation. These relational databases can be used as ground truth to evaluate any of these UDA systems despite their differences in programming interfaces. Moreover, we design diverse queries to analyze the attributes defined in the database schema, covering different types of analytical operators with varying selectivities and complexities. We conduct in-depth analysis of the key building blocks of existing UDA systems: query interface, query optimization, operator design, and data processing. We run exhaustive experiments over the benchmark to fully evaluate these systems and different techniques w.r.t. the above building blocks.
Problem

Research questions and friction points this paper is trying to address.

Benchmarking LLM-powered unstructured data analysis systems
Evaluating diverse query interfaces and optimization strategies
Assessing performance across varied datasets and operator implementations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark with manually annotated relational database ground truth
Diverse query workload covering analytical operators
Evaluation of query interface optimization and operator design
🔎 Similar Papers
No similar papers found.
Q
Qiyan Deng
Beijing Institute of Technology
J
Jianhui Li
Beijing Institute of Technology
Chengliang Chai
Chengliang Chai
Beijing Institute of Technology
Data cleaning and integration
J
Jinqi Liu
Beijing Institute of Technology
J
Junzhi She
Beijing Institute of Technology
K
Kaisen Jin
Beijing Institute of Technology
Z
Zhaoze Sun
Beijing Institute of Technology
Y
Yuhao Deng
Beijing Institute of Technology
Jia Yuan
Jia Yuan
University of Macau
Y
Ye Yuan
Beijing Institute of Technology
Guoren Wang
Guoren Wang
Beijing Institute of Technology
Lei Cao
Lei Cao
Assistant Professor, University of Arizona/Research Scientist, MIT CSAIL
DatabasesMachine learning