LMR-BENCH: Evaluating LLM Agent's Ability on Reproducing Language Modeling Research

📅 2025-06-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the scientific reasoning and engineering implementation capabilities of large language model (LLM) agents in reproducing code from top-tier NLP conference papers. Addressing the lack of systematic evaluation, we introduce LMR-BENCH—the first dedicated benchmark comprising 28 reproduction tasks drawn from ACL, EMNLP, and NeurIPS papers published over the past five years. We formally define and quantify LLMs’ code synthesis ability under three core challenges: cross-document understanding, multi-file dependency reasoning, and scientific intent decoding. We propose a dual-track evaluation paradigm combining unit test pass rate with an LLM-assisted correctness score. Experiments across GPT-4, Claude, and Llama models—deployed with ReAct and Plan-and-Execute frameworks—reveal that state-of-the-art models achieve an average pass rate below 35%, substantially underperforming human reproducers. This gap highlights fundamental limitations in abstract-to-concrete grounding and coordinated execution of complex logical dependencies.

Technology Category

Application Category

📝 Abstract
Large language model (LLM) agents have demonstrated remarkable potential in advancing scientific discovery. However, their capability in the fundamental yet crucial task of reproducing code from research papers, especially in the NLP domain, remains underexplored. This task includes unique complex reasoning challenges in the intellectual synthesis of abstract concepts and the comprehension of code repositories with interdependent files. Motivated by this gap, we present LMR-BENCH, a benchmark designed to systematically evaluate the capability of LLM agents on code reproduction from Language Modeling Research. It consists of 28 code reproduction tasks derived from 23 research papers published in top-tier NLP venues over the past five years, spanning nine fundamental categories. Models are provided with a research paper, a code repository containing one or more masked functions, and instructions for implementing these functions. We conduct extensive experiments in standard prompting and LLM agent settings with state-of-the-art LLMs, evaluating the accuracy of unit tests and performing LLM-based evaluation of code correctness. Experimental results reveal that even the most advanced models still exhibit persistent limitations in scientific reasoning and code synthesis, highlighting critical gaps in LLM agents' ability to autonomously reproduce scientific research
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM agents' code reproduction from NLP papers
Assessing complex reasoning in abstract concept synthesis
Testing code correctness in interdependent repository contexts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark for LLM agent code reproduction evaluation
Masked functions in code repository for testing
LLM-based evaluation of code correctness
🔎 Similar Papers
No similar papers found.