🤖 AI Summary
Existing unstructured data analysis (UDA) systems lack a unified, comprehensive benchmark, hindering fair comparison across heterogeneous architectures—including query interfaces, optimization strategies, and execution engines.
Method: We propose UDA-Bench, the first holistic benchmark for UDA. It comprises five high-quality, cross-domain datasets with manually annotated relational views; LLM-assisted attribute extraction enhances annotation efficiency and coverage. UDA-Bench features diverse query workloads spanning multiple operators and complexity levels, enabling both end-to-end and component-level evaluation.
Contribution/Results: Experiments demonstrate UDA-Bench’s discriminative power across core UDA modules—parsing, optimization, and execution—across state-of-the-art systems. It provides a reproducible, extensible, and standardized evaluation platform for UDA research, facilitating rigorous, architecture-agnostic performance assessment and systematic advancement of UDA methodologies.
📝 Abstract
Nowadays, the explosion of unstructured data presents immense analytical value. Leveraging the remarkable capability of large language models (LLMs) in extracting attributes of structured tables from unstructured data, researchers are developing LLM-powered data systems for users to analyze unstructured documents as working with a database. These unstructured data analysis (UDA) systems differ significantly in all aspects, including query interfaces, query optimization strategies, and operator implementations, making it unclear which performs best in which scenario. Unfortunately, there does not exist a comprehensive benchmark that offers high-quality, large-volume, and diverse datasets as well as rich query workload to thoroughly evaluate such systems. To fill this gap, we present UDA-Bench, the first benchmark for unstructured data analysis that meets all the above requirements. Specifically, we organize a team with 30 graduate students that spends over in total 10,000 hours on curating 5 datasets from various domains and constructing a relational database view from these datasets by manual annotation. These relational databases can be used as ground truth to evaluate any of these UDA systems despite their differences in programming interfaces. Moreover, we design diverse queries to analyze the attributes defined in the database schema, covering different types of analytical operators with varying selectivities and complexities. We conduct in-depth analysis of the key building blocks of existing UDA systems: query interface, query optimization, operator design, and data processing. We run exhaustive experiments over the benchmark to fully evaluate these systems and different techniques w.r.t. the above building blocks.