FDABench: A Benchmark for Data Agents on Analytical Queries over Heterogeneous Data

📅 2025-09-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current data agent evaluation faces three key challenges: (1) absence of benchmarks covering diverse, multi-source heterogeneous analytical tasks; (2) high cost and complexity in constructing high-quality test cases; and (3) poor generalizability of existing benchmarks. To address these, we propose FDABench—the first benchmark for data agents performing integrated analysis over both structured and unstructured data—comprising 2,007 diverse query tasks. We design a standardized evaluation protocol and an “agent-expert” collaborative framework, leveraging multi-source data integration and human-in-the-loop test case construction to enable efficient, reliable, and comprehensive test generation. FDABench exhibits strong cross-system and cross-framework generalizability. Empirical evaluation across state-of-the-art data agent systems reveals significant performance disparities across response quality, accuracy, latency, and token consumption—demonstrating FDABench’s effectiveness and practical utility for rigorous, holistic agent assessment.

Technology Category

Application Category

📝 Abstract
The growing demand for data-driven decision-making has created an urgent need for data agents that can integrate structured and unstructured data for analysis. While data agents show promise for enabling users to perform complex analytics tasks, this field still suffers from three critical limitations: first, comprehensive data agent benchmarks remain absent due to the difficulty of designing test cases that evaluate agents' abilities across multi-source analytical tasks; second, constructing reliable test cases that combine structured and unstructured data remains costly and prohibitively complex; third, existing benchmarks exhibit limited adaptability and generalizability, resulting in narrow evaluation scope. To address these challenges, we present FDABench, the first data agent benchmark specifically designed for evaluating agents in multi-source data analytical scenarios. Our contributions include: (i) we construct a standardized benchmark with 2,007 diverse tasks across different data sources, domains, difficulty levels, and task types to comprehensively evaluate data agent performance; (ii) we design an agent-expert collaboration framework ensuring reliable and efficient benchmark construction over heterogeneous data; (iii) we equip FDABench with robust generalization capabilities across diverse target systems and frameworks. We use FDABench to evaluate various data agent systems, revealing that each system exhibits distinct advantages and limitations regarding response quality, accuracy, latency, and token cost.
Problem

Research questions and friction points this paper is trying to address.

Lack of comprehensive benchmarks for multi-source data agents
High cost and complexity in reliable test case construction
Limited adaptability and generalizability in existing evaluation systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Standardized benchmark with 2007 diverse tasks
Agent-expert collaboration framework for construction
Robust generalization across systems and frameworks
🔎 Similar Papers
No similar papers found.
Z
Ziting Wang
College of Computing and Data Science, Nanyang Technological University
S
Shize Zhang
National University of Singapore
Haitao Yuan
Haitao Yuan
New Jersey Institute of Technology, NJ, USA, and Beihang University, Beijing, China
Deep LearningData-driven OptimizationComputational IntelligenceIntelligent DecisionsIoTs
J
Jinwei Zhu
Huawei Technologies Co., Ltd
S
Shifu Li
Huawei Technologies Co., Ltd
W
Wei Dong
College of Computing and Data Science, Nanyang Technological University
Gao Cong
Gao Cong
Nanyang Technological University
Data ManagementDatabasesData MiningSpatial Databases