MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI

📅 2026-05-09
📈 Citations: 0
Influential: 0
📄 PDF

career value

243K/year
🤖 AI Summary
This study investigates whether AI systems can autonomously invent generalizable and scalable machine learning methods rather than merely recombining existing techniques. To this end, we introduce MLS-Bench, a comprehensive benchmark spanning 12 domains and 140 tasks, evaluated through a task-driven framework incorporating test-time scaling, adaptive computation allocation, and context augmentation. Our systematic assessment reveals that while current AI systems excel at tuning established methods, they consistently struggle to generate original algorithms that surpass human-designed approaches. Moreover, simply increasing computational resources, search capacity, or context length fails to overcome fundamental limitations in scientific insight and methodological validation inherent in today’s AI.
📝 Abstract
Modern AI progress has been driven by ML methods that are generalizable across settings and scalable to larger regimes. As large language models demonstrate advanced capabilities in reasoning, coding, and engineering tasks, it is increasingly important to understand whether they can discover such methods rather than only apply existing ones. We introduce MLS-Bench, a benchmark for evaluating whether AI systems can invent generalizable and scalable ML methods. MLS-Bench contains 140 tasks across 12 domains, each requiring an agent to improve one targeted component of an ML system or algorithm and demonstrate that the improvement generalizes across controlled settings and scales. We find that current agents remain far from reliably surpassing human-designed methods, and that engineering-style tuning is easier for them than genuine method invention. We further study the effects of test-time scaling, adaptive compute allocation, and context provision on agents' discovery performance, together with case studies of their behavior. Our analyses suggest that the bottleneck is not only in proposing new methods, but also in the scientific insight needed to plan, validate, and scale claims about them. More search, compute, or context alone does not remove this bottleneck. We build and maintain a community platform for cumulative and comparable iteration, and release the data and code at https://mls-bench.com.
Problem

Research questions and friction points this paper is trying to address.

AI invention
generalizable ML methods
scalable ML
method discovery
scientific reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

MLS-Bench
method invention
generalizability
scalability
AI-driven science
🔎 Similar Papers
Bohan Lyu
Bohan Lyu
Undergraduate, Tsinghua University
Machine Learning
Yucheng Yang
Yucheng Yang
University of Zurich and Swiss Finance Institute
MacroeconomicsFinanceMachine LearningComputational EconomicsMonetary Economics
Siqiao Huang
Siqiao Huang
Institute for Interdisciplinary Information Sciences (Yao Class), Tsinghua University
Machine LearningRobotics
Jiaru Zhang
Jiaru Zhang
Purdue University
Qixin Xu
Qixin Xu
Undergraduate of Computer Science, Tsinghua University
Multi-Modal LearningReinforcement Learning
Xinghan Li
Xinghan Li
ZJU
roboticsstate estimationembodied AI
Xinyang Han
Xinyang Han
Southern University of Science and Technology
Robot controlEmbedded system
Y
Yicheng Zhang
Tsinghua University
H
Huaqing Zhang
Tsinghua University
Runhan Huang
Runhan Huang
Undergraduate, Tsinghua University
Reinforcement LearningRoboticsGenerative AI
Kaicheng Yang
Kaicheng Yang
DeepGlint
Multimodal、CV、NLP
Z
Zitao Chen
Tsinghua University
Wentao Guo
Wentao Guo
CS PhD student at Princeton University
Machine Learning
Junlin Yang
Junlin Yang
Department of Computer Science and Technology, Tsinghua University
Natural Language ProcessingMachine Learning
X
Xinyue Ai
University of Pennsylvania
Wenhao Chai
Wenhao Chai
Princeton University
Machine LearningComputer Vision
Yadi Cao
Yadi Cao
University of California San Diego
Scientific Machine LearningNumerical PDEsComputational MechanicsFluid Dynamics
Ziran Yang
Ziran Yang
Princeton University
Large Language ModelReinforcement Learning
K
Kun Wang
Princeton University
D
Dapeng Jiang
Tsinghua University
Huan-ang Gao
Huan-ang Gao
Ph.D. student, Tsinghua University
AgentVision & Robotics
Shange Tang
Shange Tang
Princeton University
Machine learningStatistics
Chengshuai Shi
Chengshuai Shi
Princeton Language and Intelligence
Reinforcement LearningIntelligent Decision-Making
S
Simon S. Du
University of Washington
Max Simchowitz
Max Simchowitz
MIT
Machine Learning TheoryRoboticsControl