A Survey on Large Language Model-based Agents for Statistics and Data Science

📅 2024-12-18
🏛️ arXiv.org
📈 Citations: 5
Influential: 0
📄 PDF
🤖 AI Summary
Non-expert users face significant barriers in performing complex data analysis. Method: This work systematically investigates LLM-driven data science agents (“data agents”) as a paradigm shift, introducing the first taxonomy of LLM agents for statistics and data science—characterizing seven core capabilities, including knowledge integration and human–agent interaction. We propose a unified architecture integrating prompt engineering, chain-of-thought reasoning, reflection mechanisms, multi-agent collaboration, and domain-specific knowledge injection, coupled with a visual interactive interface. Contribution/Results: Through analysis of representative academic and industrial case studies, we identify five fundamental challenges: scalability, statistical rigor, causal inference, interpretability, and robustness. We further articulate, for the first time, an evolutionary pathway toward autonomous intelligent statistical software—establishing both theoretical foundations and a concrete technical roadmap for next-generation intelligent analytics tools.

Technology Category

Application Category

📝 Abstract
In recent years, data science agents powered by Large Language Models (LLMs), known as"data agents,"have shown significant potential to transform the traditional data analysis paradigm. This survey provides an overview of the evolution, capabilities, and applications of LLM-based data agents, highlighting their role in simplifying complex data tasks and lowering the entry barrier for users without related expertise. We explore current trends in the design of LLM-based frameworks, detailing essential features such as planning, reasoning, reflection, multi-agent collaboration, user interface, knowledge integration, and system design, which enable agents to address data-centric problems with minimal human intervention. Furthermore, we analyze several case studies to demonstrate the practical applications of various data agents in real-world scenarios. Finally, we identify key challenges and propose future research directions to advance the development of data agents into intelligent statistical analysis software.
Problem

Research questions and friction points this paper is trying to address.

LLM-based agents simplify complex data tasks
Lower entry barriers for non-expert data users
Enable autonomous data-centric problem solving
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based agents simplify complex data tasks
Frameworks enable minimal human intervention in data problems
Multi-agent collaboration and reasoning enhance data analysis
🔎 Similar Papers
No similar papers found.
M
Maojun Sun
Department of Data Science and Artificial Intelligence, The Hong Kong Polytechnic University
R
Ruijian Han
Department of Data Science and Artificial Intelligence, The Hong Kong Polytechnic University
Binyan Jiang
Binyan Jiang
The Hong Kong Polytechnic University
Statistics
Houduo Qi
Houduo Qi
Professor, DSAI and AMA, The Hong Kong Polytechnic University
Mathematical OptimizationOperations Research
D
Defeng Sun
Department of Applied Mathematics, The Hong Kong Polytechnic University
Yancheng Yuan
Yancheng Yuan
Assistant Professor, The Hong Kong Polytechnic University
Optimization AlgorithmsMachine Learning
J
Jian Huang
Department of Data Science and Artificial Intelligence, The Hong Kong Polytechnic University; Department of Applied Mathematics, The Hong Kong Polytechnic University