🤖 AI Summary
Current data agent evaluation faces three key challenges: (1) absence of benchmarks covering diverse, multi-source heterogeneous analytical tasks; (2) high cost and complexity in constructing high-quality test cases; and (3) poor generalizability of existing benchmarks. To address these, we propose FDABench—the first benchmark for data agents performing integrated analysis over both structured and unstructured data—comprising 2,007 diverse query tasks. We design a standardized evaluation protocol and an “agent-expert” collaborative framework, leveraging multi-source data integration and human-in-the-loop test case construction to enable efficient, reliable, and comprehensive test generation. FDABench exhibits strong cross-system and cross-framework generalizability. Empirical evaluation across state-of-the-art data agent systems reveals significant performance disparities across response quality, accuracy, latency, and token consumption—demonstrating FDABench’s effectiveness and practical utility for rigorous, holistic agent assessment.
📝 Abstract
The growing demand for data-driven decision-making has created an urgent need for data agents that can integrate structured and unstructured data for analysis. While data agents show promise for enabling users to perform complex analytics tasks, this field still suffers from three critical limitations: first, comprehensive data agent benchmarks remain absent due to the difficulty of designing test cases that evaluate agents' abilities across multi-source analytical tasks; second, constructing reliable test cases that combine structured and unstructured data remains costly and prohibitively complex; third, existing benchmarks exhibit limited adaptability and generalizability, resulting in narrow evaluation scope.
To address these challenges, we present FDABench, the first data agent benchmark specifically designed for evaluating agents in multi-source data analytical scenarios. Our contributions include: (i) we construct a standardized benchmark with 2,007 diverse tasks across different data sources, domains, difficulty levels, and task types to comprehensively evaluate data agent performance; (ii) we design an agent-expert collaboration framework ensuring reliable and efficient benchmark construction over heterogeneous data; (iii) we equip FDABench with robust generalization capabilities across diverse target systems and frameworks. We use FDABench to evaluate various data agent systems, revealing that each system exhibits distinct advantages and limitations regarding response quality, accuracy, latency, and token cost.