DAComp: Benchmarking Data Agents across the Full Data Intelligence Lifecycle

📅 2025-12-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the lack of rigorous evaluation and subpar performance of data agents across the enterprise data intelligence lifecycle—spanning data engineering and data analysis. We introduce DAComp, the first end-to-end benchmark grounded in real-world industrial scenarios. DAComp comprises 210 tasks systematically covering the full workflow from raw data processing to decision insight generation, and—crucially—explicitly decouples data engineering capabilities (e.g., multi-stage SQL pipeline construction and evolution) from data analysis capabilities (e.g., open-ended business question answering and actionable recommendation generation). For evaluation, we adopt execution-based multi-metric assessment for data engineering tasks and integrate experimentally validated LLM-based adjudication with hierarchical scoring rules for analysis tasks. Empirical results reveal that state-of-the-art data agents achieve less than 20% success rate on data engineering tasks and average scores below 40% on analysis tasks, exposing fundamental limitations in process orchestration and strategic reasoning.

Technology Category

Application Category

📝 Abstract
Real-world enterprise data intelligence workflows encompass data engineering that turns raw sources into analytical-ready tables and data analysis that convert those tables into decision-oriented insights. We introduce DAComp, a benchmark of 210 tasks that mirrors these complex workflows. Data engineering (DE) tasks require repository-level engineering on industrial schemas, including designing and building multi-stage SQL pipelines from scratch and evolving existing systems under evolving requirements. Data analysis (DA) tasks pose open-ended business problems that demand strategic planning, exploratory analysis through iterative coding, interpretation of intermediate results, and the synthesis of actionable recommendations. Engineering tasks are scored through execution-based, multi-metric evaluation. Open-ended tasks are assessed by a reliable, experimentally validated LLM-judge, which is guided by hierarchical, meticulously crafted rubrics. Our experiments reveal that even state-of-the-art agents falter on DAComp. Performance on DE tasks is particularly low, with success rates under 20%, exposing a critical bottleneck in holistic pipeline orchestration, not merely code generation. Scores on DA tasks also average below 40%, highlighting profound deficiencies in open-ended reasoning and demonstrating that engineering and analysis are distinct capabilities. By clearly diagnosing these limitations, DAComp provides a rigorous and realistic testbed to drive the development of truly capable autonomous data agents for enterprise settings. Our data and code are available at https://da-comp.github.io
Problem

Research questions and friction points this paper is trying to address.

Benchmarking data agents across full data intelligence lifecycle
Evaluating performance on data engineering and analysis tasks
Diagnosing limitations in pipeline orchestration and open-ended reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmarking data engineering and analysis tasks
Using execution-based and LLM-judge evaluation methods
Diagnosing limitations in pipeline orchestration and reasoning
🔎 Similar Papers
No similar papers found.
Fangyu Lei
Fangyu Lei
Institute of Automation, Chinese Academy of Sciences
LLM-AgentCode GenerationText-to-SQLTable Reasoning
Jinxiang Meng
Jinxiang Meng
Nanjing University of Posts and Telecommunications
LLM AgentTable ReasoningTool Use
Y
Yiming Huang
UC San Diego
Junjie Zhao
Junjie Zhao
北京大学硕士生
CVML
Y
Yitong Zhang
NUS
J
Jianwen Luo
Institute of Automation, CAS; University of Chinese Academy of Sciences
X
Xin Zou
ByteDance Seed
R
Ruiyi Yang
ByteDance Seed
Wenbo Shi
Wenbo Shi
Harvard University, Singularity Energy, Inc.
Smart GridEnergy/carbon managementMachine LearningOptimizationDistributed algorithms
Y
Yan Gao
ByteDance Seed
S
Shizhu He
Institute of Automation, CAS; University of Chinese Academy of Sciences
Z
Zuo Wang
ByteDance Seed
Q
Qian Liu
TikTok
Y
Yang Wang
ByteDance Seed
K
Ke Wang
ByteDance Seed
J
Jun Zhao
Institute of Automation, CAS; University of Chinese Academy of Sciences
K
Kang Liu
Institute of Automation, CAS; University of Chinese Academy of Sciences