Scaling Generalist Data-Analytic Agents

📅 2025-09-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing open-source data agents exhibit limited performance on multi-format, large-scale, and long-horizon multi-step data analysis tasks, primarily due to data scarcity and suboptimal training strategies. This paper introduces DataMind, a general-purpose data parsing agent that innovatively designs a fine-grained task taxonomy and progressive task composition mechanism; proposes knowledge-augmented trajectory sampling and a dynamic SFT+RL hybrid training objective; and implements a low-memory, high-stability multi-turn code execution framework. Trained on our self-constructed high-quality synthetic dataset DataMind-12K, the resulting DataMind-14B and DataMind-7B models achieve average scores of 71.16% and 68.10%, respectively, across multiple data analysis benchmarks—outperforming DeepSeek-V3.1 and GPT-5, and establishing new state-of-the-art results among open-source models. All datasets, models, and code will be publicly released.

Technology Category

Application Category

📝 Abstract
Data-analytic agents are emerging as a key catalyst for automated scientific discovery and for the vision of Innovating AI. Current approaches, however, rely heavily on prompt engineering over proprietary models, while open-source models struggle to face diverse-format, large-scale data files and long-horizon, multi-step reasoning that real-world analytics demands. This paper introduces DataMind, a scalable data synthesis and agent training recipe designed to build generalist data-analytic agents. DataMind tackles three key challenges in building open-source data-analytic agents, including insufficient data resources, improper training strategy, and unstable code-based multi-turn rollout. Concretely, DataMind applies 1) a fine-grained task taxonomy and a recursive easy-to-hard task composition mechanism to increase the diversity and difficulty of synthesized queries; 2) a knowledge-augmented trajectory sampling strategy followed by model-based and rule-based filtering; 3) a dynamically adjustable training objective combining both SFT and RL losses; 4) a memory-frugal and stable code-based multi-turn rollout framework. Built on DataMind, we curate DataMind-12K, a high-quality trajectory set spanning diverse domains, task categories, and data file formats for data-analytic tasks. Trained on DataMind-12K, our DataMind-14B achieves state-of-the-art with an average score of 71.16% on multiple data analysis benchmarks, outperforming the strongest proprietary baselines DeepSeek-V3.1 and GPT-5. Our DataMind-7B also performs best among all open-source models with a score of 68.10%. We also incorporate some empirical insights gained from our exploratory trials into the analysis experiments, aiming to provide actionable insights about agentic training for the community. We will release DataMind-12K and DataMind-7B,14B for the community's future research.
Problem

Research questions and friction points this paper is trying to address.

Open-source models struggle with diverse-format large-scale data files
Current approaches lack proper training strategies for data-analytic agents
Existing methods face unstable code-based multi-turn reasoning challenges
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-grained task taxonomy with recursive composition
Knowledge-augmented trajectory sampling with filtering
Dynamic training objective combining SFT and RL
🔎 Similar Papers
No similar papers found.
Shuofei Qiao
Shuofei Qiao
Zhejiang University
AI AgentLarge Language ModelsNatural Language ProcessingKnowledge Graphs
Y
Yanqiu Zhao
Zhejiang University
Z
Zhisong Qiu
Zhejiang University
X
Xiaobin Wang
Alibaba Group
Jintian Zhang
Jintian Zhang
Zhejiang University
NLPLLMs
Z
Zhao Bin
Zhejiang University
Ningyu Zhang
Ningyu Zhang
Ph.D. Student, Vanderbilt University
artificial intelligencelearning analyticslearning environments
Y
Yong Jiang
Alibaba Group
Pengjun Xie
Pengjun Xie
Alibaba Group
NLP/IR/ML
F
Fei Huang
Alibaba Group
H
Huajun Chen
Zhejiang University