AutoData: A Multi-Agent System for Open Web Data Collection

๐Ÿ“… 2025-05-21
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing web data collection methods face dual challenges of high labor costs and poor scalability: wrapper-based approaches suffer from weak adaptability and reproducibility, while LLM-driven solutions incur prohibitive computational and financial overhead. This paper introduces Instruct2DSโ€”the first fully automated, natural-language-instruction-driven system for structured data extraction from open web pages. Its core innovations are: (1) a task-oriented multi-agent hypergraph communication architecture enabling efficient collaboration with minimal token consumption; (2) a hypergraph caching mechanism that significantly accelerates dynamic task response; and (3) the first cross-domain benchmark, Instruct2DS, supporting real-time data acquisition. Experiments demonstrate that Instruct2DS outperforms all state-of-the-art methods on the Instruct2DS benchmark and three public benchmarks, achieving superior performance on complex tasks such as childrenโ€™s book scraping and survey paper extraction. The code and dataset are publicly released.

Technology Category

Application Category

๐Ÿ“ Abstract
The exponential growth of data-driven systems and AI technologies has intensified the demand for high-quality web-sourced datasets. While existing datasets have proven valuable, conventional web data collection approaches face significant limitations in terms of human effort and scalability. Current data-collecting solutions fall into two categories: wrapper-based methods that struggle with adaptability and reproducibility, and large language model (LLM)-based approaches that incur substantial computational and financial costs. To address these challenges, we propose AutoData, a novel multi-agent system for Automated web Data collection, that requires minimal human intervention, i.e., only necessitating a natural language instruction specifying the desired dataset. In addition, AutoData is designed with a robust multi-agent architecture, featuring a novel oriented message hypergraph coordinated by a central task manager, to efficiently organize agents across research and development squads. Besides, we introduce a novel hypergraph cache system to advance the multi-agent collaboration process that enables efficient automated data collection and mitigates the token cost issues prevalent in existing LLM-based systems. Moreover, we introduce Instruct2DS, a new benchmark dataset supporting live data collection from web sources across three domains: academic, finance, and sports. Comprehensive evaluations over Instruct2DS and three existing benchmark datasets demonstrate AutoData's superior performance compared to baseline methods. Case studies on challenging tasks such as picture book collection and paper extraction from surveys further validate its applicability. Our source code and dataset are available at https://github.com/GraphResearcher/AutoData.
Problem

Research questions and friction points this paper is trying to address.

High-quality web-sourced datasets demand exceeds current collection methods.
Existing approaches lack adaptability, scalability, and cost-efficiency.
AutoData automates web data collection with minimal human intervention.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent system for automated web data collection
Oriented message hypergraph coordinated by task manager
Hypergraph cache system to reduce token costs
T
Tianyi Ma
University of Notre Dame
Yiyue Qian
Yiyue Qian
Amazon
graph representation learningLLMmulti-modal learning
Z
Zheyuan Zhang
University of Notre Dame
Z
Zehong Wang
University of Notre Dame
X
Xiaoye Qian
Amazon
F
Feifan Bai
University of Washington
Y
Yifan Ding
University of Notre Dame
X
Xuwei Luo
Purdue University
S
Shinan Zhang
Amazon
Keerthiram Murugesan
Keerthiram Murugesan
Research Scientist, IBM Research AI/ Carnegie Mellon University
Machine learningArtificial Intelligence
Chuxu Zhang
Chuxu Zhang
Associate Professor of CSE, University of Connecticut (UConn)
Machine LearningDeep LearningData Mining
Y
Yanfang Ye
University of Notre Dame