A Survey of Data Agents: Emerging Paradigm or Overstated Hype?

📅 2025-10-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The term “data agent” suffers from ambiguous definitions and inconsistent usage, blurring the distinction between simple query systems and sophisticated autonomous intelligent agents—leading to misaligned user expectations, unclear accountability, and a lack of industry standards. Method: We propose the first six-level autonomy taxonomy (L0–L5) for data agents, inspired by SAE J3016 for autonomous vehicles, rigorously defining capability boundaries across goal formulation, decision execution, exception handling, and responsibility attribution. Leveraging LLMs and automated data management, we construct an evaluable, extensible taxonomy and identify the L2→L3 transition as a critical inflection point. Contribution/Results: We derive a technology roadmap toward generative, fully autonomous data agents. This framework establishes a standardized foundation for the Data+AI ecosystem, enabling consistent terminology, systematic evaluation, and responsible innovation.

Technology Category

Application Category

📝 Abstract
The rapid advancement of large language models (LLMs) has spurred the emergence of data agents--autonomous systems designed to orchestrate Data + AI ecosystems for tackling complex data-related tasks. However, the term "data agent" currently suffers from terminological ambiguity and inconsistent adoption, conflating simple query responders with sophisticated autonomous architectures. This terminological ambiguity fosters mismatched user expectations, accountability challenges, and barriers to industry growth. Inspired by the SAE J3016 standard for driving automation, this survey introduces the first systematic hierarchical taxonomy for data agents, comprising six levels that delineate and trace progressive shifts in autonomy, from manual operations (L0) to a vision of generative, fully autonomous data agents (L5), thereby clarifying capability boundaries and responsibility allocation. Through this lens, we offer a structured review of existing research arranged by increasing autonomy, encompassing specialized data agents for data management, preparation, and analysis, alongside emerging efforts toward versatile, comprehensive systems with enhanced autonomy. We further analyze critical evolutionary leaps and technical gaps for advancing data agents, especially the ongoing L2-to-L3 transition, where data agents evolve from procedural execution to autonomous orchestration. Finally, we conclude with a forward-looking roadmap, envisioning the advent of proactive, generative data agents.
Problem

Research questions and friction points this paper is trying to address.

Addressing terminological ambiguity in data agent definitions
Establishing hierarchical taxonomy for data agent autonomy levels
Analyzing evolutionary gaps in autonomous data orchestration systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical taxonomy delineates data agent autonomy levels
Systematic review organizes research by increasing autonomy
Roadmap envisions proactive generative data agents
Yizhang Zhu
Yizhang Zhu
The Hong Kong University of Science and Technology (Guangzhou)
AI for Data AnalyticsAI for DBData-centric AI
Liangwei Wang
Liangwei Wang
HKUST(GZ)
Information VisualizationHuman-Computer Interaction
C
Chenyu Yang
The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China
X
Xiaotian Lin
The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China
Boyan Li
Boyan Li
The Hong Kong University of Science and Technology (Guangzhou)
DatabasesNatural Language to SQL
W
Wei Zhou
Shanghai Jiao Tong University, Shanghai, China
X
Xinyu Liu
The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China
Zhangyang Peng
Zhangyang Peng
Master Student, Hangzhou Dianzi University
Vector DatabaseApproximate Nearest Neighbor SearchCommunity Search
Tianqi Luo
Tianqi Luo
Research assistant, University of Minnesota, Twin Cities
semiconductor physicsoptics
Y
Yu Li
Renmin University of China, Beijing, China
Chengliang Chai
Chengliang Chai
Beijing Institute of Technology
Data cleaning and integration
C
Chong Chen
Huawei
S
Shimin Di
Southeast University, Nanjing, China
Ju Fan
Ju Fan
Renmin University of China
DatabaseCrowdsourcingInfluence MaximizationData Integration
Ji Sun
Ji Sun
Huawei
database
Nan Tang
Nan Tang
National Institute of Biological Sciences, Beijing
stem cell biologyaginglung diseases
F
Fugee Tsung
The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China
J
Jiannan Wang
Tsinghua University, Beijing, China
Chenglin Wu
Chenglin Wu
Founder & CEO, DeepWisdom
Foundation AgentsArtificial IntelligenceAutoML
Y
Yanwei Xu
Huawei
Shaolei Zhang
Shaolei Zhang
Institute of Computing Technology, Chinese Academy of Sciences (ICT/CAS)
Natural Language ProcessingLarge Language ModelMultimodal LLMsSimultaneous Translation
Y
Yong Zhang
Tsinghua University, Beijing, China
Xuanhe Zhou
Xuanhe Zhou
Assistant Professor, Shanghai Jiao Tong University
Data ManagementArtificial Intelligence
Guoliang Li
Guoliang Li
Professor, Tsinghua University
DatabaseBig DataCrowdsourcingData Cleaning & Integration
Yuyu Luo
Yuyu Luo
Assistant Professor, HKUST(GZ) / HKUST
Data AgentsLLM AgentsDatabaseText-to-SQLData-centric AI