MLGym: A New Framework and Benchmark for Advancing AI Research Agents

📅 2025-02-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing evaluations of large language model (LLM) agents lack standardized, end-to-end assessment of their capabilities in autonomous AI research. Method: We introduce MLGym, the first Gym-style framework for AI research, accompanied by MLGym-Bench—a benchmark comprising 13 cross-domain, open-ended research tasks requiring agents to perform hypothesis generation, model implementation, iterative experimentation, and result analysis. Contribution/Results: MLGym is the first standardized environment supporting RL-based training for AI research; it systematically defines and empirically evaluates LLMs’ capabilities and limitations in autonomous scientific discovery. It supports integration of state-of-the-art multimodal models (e.g., GPT-4o, Claude-3.5, Llama-3.1-405B) and extensible algorithmic plugins. Experiments reveal that even top-tier models succeed only in localized tasks—such as hyperparameter optimization—but fail to generate novel algorithms or architectures. The framework is open-sourced, establishing a foundational infrastructure for evaluating and advancing AI-driven scientific autonomy.

Technology Category

Application Category

📝 Abstract
We introduce Meta MLGym and MLGym-Bench, a new framework and benchmark for evaluating and developing LLM agents on AI research tasks. This is the first Gym environment for machine learning (ML) tasks, enabling research on reinforcement learning (RL) algorithms for training such agents. MLGym-bench consists of 13 diverse and open-ended AI research tasks from diverse domains such as computer vision, natural language processing, reinforcement learning, and game theory. Solving these tasks requires real-world AI research skills such as generating new ideas and hypotheses, creating and processing data, implementing ML methods, training models, running experiments, analyzing the results, and iterating through this process to improve on a given task. We evaluate a number of frontier large language models (LLMs) on our benchmarks such as Claude-3.5-Sonnet, Llama-3.1 405B, GPT-4o, o1-preview, and Gemini-1.5 Pro. Our MLGym framework makes it easy to add new tasks, integrate and evaluate models or agents, generate synthetic data at scale, as well as develop new learning algorithms for training agents on AI research tasks. We find that current frontier models can improve on the given baselines, usually by finding better hyperparameters, but do not generate novel hypotheses, algorithms, architectures, or substantial improvements. We open-source our framework and benchmark to facilitate future research in advancing the AI research capabilities of LLM agents.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM agents on AI research tasks
Developing reinforcement learning algorithms for ML tasks
Assessing frontier models' performance on diverse AI domains
Innovation

Methods, ideas, or system contributions that make the work stand out.

First ML Gym environment
13 diverse AI tasks
Open-source framework for LLMs
🔎 Similar Papers
No similar papers found.
Deepak Nathani
Deepak Nathani
Ph.D. Student, University of California - Santa Barbara
NLPControllable Text GenerationConversational Agents
Lovish Madaan
Lovish Madaan
AI at Meta & University College London
Machine LearningNatural Language Processing
Nicholas Roberts
Nicholas Roberts
PhD candidate UW-Madison
Machine LearningAutoMLdata-centric AI
N
Nikolay Bashlykov
GenAI at Meta
A
Ajay Menon
GenAI at Meta
Vincent Moens
Vincent Moens
Facebook
AIMLRLDLNormalizing flows
A
Amar Budhiraja
GenAI at Meta
Despoina Magka
Despoina Magka
University of Oxford, Department of Computer Science
Artificial intelligenceKnowledge representation and reasoningLogic
V
Vladislav Vorotilov
GenAI at Meta
Gaurav Chaurasia
Gaurav Chaurasia
Meta
AI agentsComputer VisionCompute Graphics
Dieuwke Hupkes
Dieuwke Hupkes
Meta
natural language processingcomputational linguisticssemantic parsingartificial neural networks
Ricardo Silveira Cabral
Ricardo Silveira Cabral
Distinguished Research Scientist, NVIDIA
Language ProcessingComputer VisionArtificial Intelligence
Tatiana Shavrina
Tatiana Shavrina
Meta
Natural language processingcomputational linguisticsbenchmarkingmultilinguality
Jakob Foerster
Jakob Foerster
Associate Professor, University of Oxford
Artificial Intelligence
Yoram Bachrach
Yoram Bachrach
Meta (FAIR)
Artificial IntelligenceMachine LearningMultiagent Systems
William Yang Wang
William Yang Wang
Mellichamp Chair Professor, University of California, Santa Barbara
Natural Language ProcessingMachine LearningArtificial IntelligenceLanguage and Vision
Roberta Raileanu
Roberta Raileanu
Research Scientist at Google DeepMind, Honorary Lecturer at UCL
Artificial IntelligenceReinforcement LearningDeep LearningOpen-Ended Learning