AgentExpt: Automating AI Experiment Design with LLM-based Resource Retrieval Agent

📅 2025-11-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address insufficient coverage of datasets and baselines—and overreliance on textual similarity at the expense of experimental suitability—in AI experiment design, this paper proposes a collective-aware retrieval framework. First, it constructs a large-scale academic citation network from top-tier conference papers (2019–2023), integrating citation contexts and paper self-descriptions to generate rich semantic representations. Second, it fine-tunes an embedding model for efficient initial retrieval and introduces an LLM-driven reasoning-enhanced re-ranking module that generates interpretable justification for recommendations. The resulting dataset covers 85% of commonly used baselines and datasets from ACL, NeurIPS, ICML, and other leading venues. Experiments demonstrate significant improvements: Recall@20 increases by 5.85 percentage points and HitRate@5 by 8.30 percentage points, markedly enhancing both comprehensiveness and experimental applicability of recommendations.

Technology Category

Application Category

📝 Abstract
Large language model agents are becoming increasingly capable at web-centric tasks such as information retrieval, complex reasoning. These emerging capabilities have given rise to surge research interests in developing LLM agent for facilitating scientific quest. One key application in AI research is to automate experiment design through agentic dataset and baseline retrieval. However, prior efforts suffer from limited data coverage, as recommendation datasets primarily harvest candidates from public portals and omit many datasets actually used in published papers, and from an overreliance on content similarity that biases model toward superficial similarity and overlooks experimental suitability. Harnessing collective perception embedded in the baseline and dataset citation network, we present a comprehensive framework for baseline and dataset recommendation. First, we design an automated data-collection pipeline that links roughly one hundred thousand accepted papers to the baselines and datasets they actually used. Second, we propose a collective perception enhanced retriever. To represent the position of each dataset or baseline within the scholarly network, it concatenates self-descriptions with aggregated citation contexts. To achieve efficient candidate recall, we finetune an embedding model on these representations. Finally, we develop a reasoning-augmented reranker that exact interaction chains to construct explicit reasoning chains and finetunes a large language model to produce interpretable justifications and refined rankings. The dataset we curated covers 85% of the datasets and baselines used at top AI conferences over the past five years. On our dataset, the proposed method outperforms the strongest prior baseline with average gains of +5.85% in Recall@20, +8.30% in HitRate@5. Taken together, our results advance reliable, interpretable automation of experimental design.
Problem

Research questions and friction points this paper is trying to address.

Automating AI experiment design through dataset and baseline retrieval
Overcoming limited data coverage in existing recommendation systems
Addressing overreliance on superficial content similarity metrics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated pipeline linking papers to used datasets and baselines
Collective perception enhanced retriever using citation networks
Reasoning-augmented reranker with interpretable justifications
Y
Yu Li
Tsinghua University, Beijing, China
L
Lehui Li
Shandong University, Jinan, China
Q
Qingmin Liao
Tsinghua University, Beijing, China
Fengli Xu
Fengli Xu
Tsinghua University
LLM AgentData ScienceSocial ComputingScience of ScienceUrban Science
Y
Yong Li
Tsinghua University, Beijing, China