AgentDS Technical Report: Benchmarking the Future of Human-AI Collaboration in Domain-Specific Data Science

📅 2026-03-19

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This study systematically evaluates the performance gap between AI agents and human experts in domain-specific data science tasks and explores the potential of human-AI collaboration. To this end, the authors introduce AgentDS, a benchmark comprising 17 tasks across six industries, and organize a competition involving 29 teams employing large language model–based agents, real-world multi-industry datasets, and a standardized evaluation framework. The results demonstrate that purely AI-driven approaches generally perform at or below the median level of human participants, whereas the top-performing solutions consistently emerge from human-AI collaboration. This work provides the first quantitative evidence of the irreplaceable role of human experts in domain-specific reasoning and establishes a new paradigm: human-AI synergy outperforms purely autonomous AI systems.

Technology Category

Application Category

📝 Abstract

Data science plays a critical role in transforming complex data into actionable insights across numerous domains. Recent developments in large language models (LLMs) and artificial intelligence (AI) agents have significantly automated data science workflow. However, it remains unclear to what extent AI agents can match the performance of human experts on domain-specific data science tasks, and in which aspects human expertise continues to provide advantages. We introduce AgentDS, a benchmark and competition designed to evaluate both AI agents and human-AI collaboration performance in domain-specific data science. AgentDS consists of 17 challenges across six industries: commerce, food production, healthcare, insurance, manufacturing, and retail banking. We conducted an open competition involving 29 teams and 80 participants, enabling systematic comparison between human-AI collaborative approaches and AI-only baselines. Our results show that current AI agents struggle with domain-specific reasoning. AI-only baselines perform near or below the median of competition participants, while the strongest solutions arise from human-AI collaboration. These findings challenge the narrative of complete automation by AI and underscore the enduring importance of human expertise in data science, while illuminating directions for the next generation of AI. Visit the AgentDS website here: https://agentds.org/ and open source datasets here: https://huggingface.co/datasets/lainmn/AgentDS .

Problem

Research questions and friction points this paper is trying to address.

human-AI collaboration

domain-specific data science

AI agent benchmarking

large language models

expertise advantage

Innovation

Methods, ideas, or system contributions that make the work stand out.

human-AI collaboration

domain-specific data science

AI benchmarking