Can AI Agents Answer Your Data Questions? A Benchmark for Data Agents

📅 2026-03-20

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Existing benchmarks inadequately assess AI agents’ ability to integrate, transform, and analyze data across heterogeneous databases in real-world enterprise settings to answer natural language questions. To address this gap, this work proposes DAB, the first end-to-end evaluation benchmark for data agents in multi-source, heterogeneous environments. DAB encompasses 12 datasets spanning 9 domains and 4 database systems, combining both structured and unstructured data, with its task suite grounded in empirical use cases from six industries. Experimental results reveal that even the strongest current model, Gemini-3-Pro, achieves only a 38% pass@1 accuracy on DAB, highlighting significant limitations of existing AI agents in complex, real-world data scenarios and underscoring the benchmark’s challenge and necessity.

Technology Category

Application Category

📝 Abstract

Users across enterprises increasingly rely on AI agents to query their data through natural language. However, building reliable data agents remains difficult because real-world data is often fragmented across multiple heterogeneous database systems, with inconsistent references and information buried in unstructured text. Existing benchmarks only tackle individual pieces of this problem -- e.g., translating natural-language questions into SQL queries, answering questions over small tables provided in context -- but do not evaluate the full pipeline of integrating, transforming, and analyzing data across multiple database systems. To fill this gap, we present the Data Agent Benchmark (DAB), grounded in a formative study of enterprise data agent workloads across six industries. DAB comprises 54 queries across 12 datasets, 9 domains, and 4 database management systems. On DAB, the best frontier model (Gemini-3-Pro) achieves only 38% pass@1 accuracy. We benchmark five frontier LLMs, analyze their failure modes, and distill takeaways for future data agent development. Our benchmark and experiment code are published at github.com/ucbepic/DataAgentBench.

Problem

Research questions and friction points this paper is trying to address.

AI Agents

Data Integration

Natural Language Query

Heterogeneous Databases

Benchmark

Innovation

Methods, ideas, or system contributions that make the work stand out.

Data Agent Benchmark

heterogeneous databases

natural language to data query