🤖 AI Summary
This study evaluates large language models’ (LLMs) ability to implement *unseen, cutting-edge machine learning ideas*—specifically, whether they can correctly generate executable code for novel methods introduced in 2024–2025 top-tier conference papers, which were absent from their pretraining data.
Method: We introduce ML-CodeBench, the first community-driven benchmark targeting *novel research ideas*, comprising 212 rigorously human-verified tasks spanning diverse ML domains. It supports evaluation across 30+ open- and closed-weight LLMs, employing paper-grounded task construction, multi-dimensional failure root-cause analysis, and systematic data contamination assessment.
Contribution/Results: State-of-the-art models exhibit severe limitations: Gemini-2.5-Pro-Preview achieves only 37.3% success rate; most models fall below 30%. These results reveal a critical generalization bottleneck in LLMs’ capacity for original research-level code generation, establishing ML-CodeBench as a novel diagnostic benchmark and evaluation framework for scientific AI assistance.
📝 Abstract
Large language models (LLMs) have shown promise in transforming machine learning research, yet their capability to faithfully implement novel ideas from recent research papers-ideas unseen during pretraining-remains unclear. We introduce ResearchCodeBench, a benchmark of 212 coding challenges that evaluates LLMs' ability to translate cutting-edge ML contributions from top 2024-2025 research papers into executable code. We assessed 30+ proprietary and open-source LLMs, finding that even the best models correctly implement less than 40% of the code. We find Gemini-2.5-Pro-Preview to perform best at 37.3% success rate, with O3 (High) and O4-mini (High) following behind at 32.3% and 30.8% respectively. We present empirical findings on performance comparison, contamination, and error patterns. By providing a rigorous and community-driven evaluation platform, ResearchCodeBench enables continuous understanding and advancement of LLM-driven innovation in research code generation.