FDABench: A Benchmark for Data Agents on Analytical Queries over Heterogeneous Data

📅 2025-09-02

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Current data agent evaluation faces three key challenges: (1) absence of benchmarks covering diverse, multi-source heterogeneous analytical tasks; (2) high cost and complexity in constructing high-quality test cases; and (3) poor generalizability of existing benchmarks. To address these, we propose FDABench—the first benchmark for data agents performing integrated analysis over both structured and unstructured data—comprising 2,007 diverse query tasks. We design a standardized evaluation protocol and an “agent-expert” collaborative framework, leveraging multi-source data integration and human-in-the-loop test case construction to enable efficient, reliable, and comprehensive test generation. FDABench exhibits strong cross-system and cross-framework generalizability. Empirical evaluation across state-of-the-art data agent systems reveals significant performance disparities across response quality, accuracy, latency, and token consumption—demonstrating FDABench’s effectiveness and practical utility for rigorous, holistic agent assessment.

Technology Category

Application Category

📝 Abstract

The growing demand for data-driven decision-making has created an urgent need for data agents that can integrate structured and unstructured data for analysis. While data agents show promise for enabling users to perform complex analytics tasks, this field still suffers from three critical limitations: first, comprehensive data agent benchmarks remain absent due to the difficulty of designing test cases that evaluate agents' abilities across multi-source analytical tasks; second, constructing reliable test cases that combine structured and unstructured data remains costly and prohibitively complex; third, existing benchmarks exhibit limited adaptability and generalizability, resulting in narrow evaluation scope. To address these challenges, we present FDABench, the first data agent benchmark specifically designed for evaluating agents in multi-source data analytical scenarios. Our contributions include: (i) we construct a standardized benchmark with 2,007 diverse tasks across different data sources, domains, difficulty levels, and task types to comprehensively evaluate data agent performance; (ii) we design an agent-expert collaboration framework ensuring reliable and efficient benchmark construction over heterogeneous data; (iii) we equip FDABench with robust generalization capabilities across diverse target systems and frameworks. We use FDABench to evaluate various data agent systems, revealing that each system exhibits distinct advantages and limitations regarding response quality, accuracy, latency, and token cost.

Problem

Research questions and friction points this paper is trying to address.

Lack of comprehensive benchmarks for multi-source data agents

High cost and complexity in reliable test case construction

Limited adaptability and generalizability in existing evaluation systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Standardized benchmark with 2007 diverse tasks

Agent-expert collaboration framework for construction

Robust generalization across systems and frameworks

🔎 Similar Papers

No similar papers found.