ResearchCodeBench: Benchmarking LLMs on Implementing Novel Machine Learning Research Code

📅 2025-06-02

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This study evaluates large language models’ (LLMs) ability to implement *unseen, cutting-edge machine learning ideas*—specifically, whether they can correctly generate executable code for novel methods introduced in 2024–2025 top-tier conference papers, which were absent from their pretraining data. Method: We introduce ML-CodeBench, the first community-driven benchmark targeting *novel research ideas*, comprising 212 rigorously human-verified tasks spanning diverse ML domains. It supports evaluation across 30+ open- and closed-weight LLMs, employing paper-grounded task construction, multi-dimensional failure root-cause analysis, and systematic data contamination assessment. Contribution/Results: State-of-the-art models exhibit severe limitations: Gemini-2.5-Pro-Preview achieves only 37.3% success rate; most models fall below 30%. These results reveal a critical generalization bottleneck in LLMs’ capacity for original research-level code generation, establishing ML-CodeBench as a novel diagnostic benchmark and evaluation framework for scientific AI assistance.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have shown promise in transforming machine learning research, yet their capability to faithfully implement novel ideas from recent research papers-ideas unseen during pretraining-remains unclear. We introduce ResearchCodeBench, a benchmark of 212 coding challenges that evaluates LLMs' ability to translate cutting-edge ML contributions from top 2024-2025 research papers into executable code. We assessed 30+ proprietary and open-source LLMs, finding that even the best models correctly implement less than 40% of the code. We find Gemini-2.5-Pro-Preview to perform best at 37.3% success rate, with O3 (High) and O4-mini (High) following behind at 32.3% and 30.8% respectively. We present empirical findings on performance comparison, contamination, and error patterns. By providing a rigorous and community-driven evaluation platform, ResearchCodeBench enables continuous understanding and advancement of LLM-driven innovation in research code generation.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' ability to implement novel ML research code

Assessing code generation from cutting-edge 2024-2025 research papers

Benchmarking 30+ LLMs with less than 40% correct implementation rate

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmarking LLMs on novel ML research code

Evaluating 30+ models on 212 coding challenges

Assessing code generation from 2024-2025 papers

🔎 Similar Papers

A Survey on Evaluating Large Language Models in Code Generation Tasks