AI Scientist via Synthetic Task Scaling

📅 2026-03-17

📈 Citations: 0

✨ Influential: 0

career value

234K/year

🤖 AI Summary

Existing AI research agents often produce seemingly plausible but ineffective machine learning solutions due to a lack of systematic training. To address this, this work proposes the first scalable synthetic task generation framework that automatically constructs high-quality, executable research tasks through topic sampling, proposal generation grounded in real-world Hugging Face datasets, and self-debugging validation. The framework further leverages trajectory distillation—transferring effective research behaviors from GPT-5 to Qwen3—to guide student models in learning valid scientific reasoning paths. Evaluated on the MLGym benchmark, Qwen3-4B and Qwen3-8B models trained with this approach achieve 9% and 12% relative improvements in Area Under the Performance curve (AUP), respectively, substantially outperforming baseline methods.

Technology Category

Application Category

📝 Abstract

With the advent of AI agents, automatic scientific discovery has become a tenable goal. Many recent works scaffold agentic systems that can perform machine learning research, but don't offer a principled way to train such agents -- and current LLMs often generate plausible-looking but ineffective ideas. To make progress on training agents that can learn from doing, we provide a novel synthetic environment generation pipeline targeting machine learning agents. Our pipeline automatically synthesizes machine learning challenges compatible with the SWE-agent framework, covering topic sampling, dataset proposal, and code generation. The resulting synthetic tasks are 1) grounded in real machine learning datasets, because the proposed datasets are verified against the Huggingface API and are 2) verified for higher quality with a self-debugging loop. To validate the effectiveness of our synthetic tasks, we tackle MLGym, a benchmark for machine learning tasks. From the synthetic tasks, we sample trajectories from a teacher model (GPT-5), then use the trajectories to train a student model (Qwen3-4B and Qwen3-8B). The student models trained with our synthetic tasks achieve improved performance on MLGym, raising the AUP metric by 9% for Qwen3-4B and 12% for Qwen3-8B.

Problem

Research questions and friction points this paper is trying to address.

AI Scientist

Synthetic Task Scaling

Machine Learning Agents

Automatic Scientific Discovery

Agent Training

Innovation

Methods, ideas, or system contributions that make the work stand out.

synthetic task generation

AI scientist

machine learning agents