MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI

📅 2026-05-09

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

This study investigates whether AI systems can autonomously invent generalizable and scalable machine learning methods rather than merely recombining existing techniques. To this end, we introduce MLS-Bench, a comprehensive benchmark spanning 12 domains and 140 tasks, evaluated through a task-driven framework incorporating test-time scaling, adaptive computation allocation, and context augmentation. Our systematic assessment reveals that while current AI systems excel at tuning established methods, they consistently struggle to generate original algorithms that surpass human-designed approaches. Moreover, simply increasing computational resources, search capacity, or context length fails to overcome fundamental limitations in scientific insight and methodological validation inherent in today’s AI.

📝 Abstract

Modern AI progress has been driven by ML methods that are generalizable across settings and scalable to larger regimes. As large language models demonstrate advanced capabilities in reasoning, coding, and engineering tasks, it is increasingly important to understand whether they can discover such methods rather than only apply existing ones. We introduce MLS-Bench, a benchmark for evaluating whether AI systems can invent generalizable and scalable ML methods. MLS-Bench contains 140 tasks across 12 domains, each requiring an agent to improve one targeted component of an ML system or algorithm and demonstrate that the improvement generalizes across controlled settings and scales. We find that current agents remain far from reliably surpassing human-designed methods, and that engineering-style tuning is easier for them than genuine method invention. We further study the effects of test-time scaling, adaptive compute allocation, and context provision on agents' discovery performance, together with case studies of their behavior. Our analyses suggest that the bottleneck is not only in proposing new methods, but also in the scientific insight needed to plan, validate, and scale claims about them. More search, compute, or context alone does not remove this bottleneck. We build and maintain a community platform for cumulative and comparable iteration, and release the data and code at https://mls-bench.com.

Problem

Research questions and friction points this paper is trying to address.

AI invention

generalizable ML methods

scalable ML

method discovery

scientific reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

MLS-Bench

method invention

generalizability