AIDABench: AI Data Analytics Benchmark

📅 2026-02-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes the first comprehensive benchmark for enterprise-grade, end-to-end data intelligence analysis, addressing the limitations of existing AI evaluation frameworks that often focus on isolated capabilities or oversimplified scenarios. The benchmark encompasses three core tasks—question answering, visualization, and report generation—and is constructed from over 600 real-world, cross-industry, heterogeneous document-based tasks, some of which require human experts 1–2 hours to complete. It integrates multimodal understanding, processing of both structured and unstructured data, and automated evaluation, ensuring reproducibility and extensibility. Experiments across 11 state-of-the-art models reveal that even the best-performing model achieves only a 59.43% pass-at-1 accuracy, highlighting significant gaps in current AI systems’ ability to handle complex, real-world data analysis tasks.
📝 Abstract
As AI-driven document understanding and processing tools become increasingly prevalent in real-world applications, the need for rigorous evaluation standards has grown increasingly urgent. Existing benchmarks and evaluations often focus on isolated capabilities or simplified scenarios, failing to capture the end-to-end task effectiveness required in practical settings. To address this gap, we introduce AIDABench, a comprehensive benchmark for evaluating AI systems on complex data analytics tasks in an end-to-end manner. AIDABench encompasses 600+ diverse document analysis tasks across three core capability dimensions: question answering, data visualization, and file generation. These tasks are grounded in realistic scenarios involving heterogeneous data types, including spreadsheets, databases, financial reports, and operational records, and reflect analytical demands across diverse industries and job functions. Notably, the tasks in AIDABench are sufficiently challenging that even human experts require 1-2 hours per question when assisted by AI tools, underscoring the benchmark's difficulty and real-world complexity. We evaluate 11 state-of-the-art models on AIDABench, spanning both proprietary (e.g., Claude Sonnet 4.5, Gemini 3 Pro Preview) and open-source (e.g., Qwen3-Max-2026-01-23-Thinking) families. Our results reveal that complex, real-world data analytics tasks remain a significant challenge for current AI systems, with the best-performing model achieving only 59.43% pass-at-1. We provide a detailed analysis of failure modes across each capability dimension and identify key challenges for future research. AIDABench offers a principled reference for enterprise procurement, tool selection, and model optimization, and is publicly available at https://github.com/MichaelYang-lyx/AIDABench.
Problem

Research questions and friction points this paper is trying to address.

AI benchmarking
document understanding
data analytics
end-to-end evaluation
real-world complexity
Innovation

Methods, ideas, or system contributions that make the work stand out.

AI benchmarking
end-to-end evaluation
document understanding
data analytics
real-world complexity
🔎 Similar Papers
No similar papers found.
Y
Yibo Yang
SenseTime Research
F
Fei Lei
SenseTime Research
Yixuan Sun
Yixuan Sun
Fudan University
Y
Yantao Zeng
SenseTime Research
C
Chengguang Lv
SenseTime Research
J
Jiancao Hong
SenseTime Research
Jiaojiao Tian
Jiaojiao Tian
DLR
2D/3D change detectionimage processing
T
Tianyu Qiu
SenseTime Research
X
Xin Wang
SenseTime Research
Y
Yanbing Chen
SenseTime Research
Y
Yanjie Li
SenseTime Research
Z
Zheng Pan
SenseTime Research
X
Xiaochen Zhou
SenseTime Research
Guanzhou Chen
Guanzhou Chen
Shanghai Jiao Tong University; Shanghai AI Laboratory
H
Haoran Lv
SenseTime Research
Y
Yuning Xu
SenseTime Research
Y
Yue Ou
SenseTime Research
H
Haodong Liu
SenseTime Research
S
Shiqi He
SenseTime Research
A
Anya Jia
SenseTime Research
Y
Yulei Xin
SenseTime Research
Huan Wu
Huan Wu
Assistant Research Scientist of ESSIC, University of Maryland/NASA GSFC
HydrometeorologyHydrological ModellingFloodHydroEcologyLand Surface Modelling
L
Liang Liu
SenseTime Research
J
Jiaye Ge
Shanghai AI Laboratory
J
Jianxin Dong
Shanghai AI Laboratory