LMR-BENCH: Evaluating LLM Agent's Ability on Reproducing Language Modeling Research

📅 2025-06-19

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This work investigates the scientific reasoning and engineering implementation capabilities of large language model (LLM) agents in reproducing code from top-tier NLP conference papers. Addressing the lack of systematic evaluation, we introduce LMR-BENCH—the first dedicated benchmark comprising 28 reproduction tasks drawn from ACL, EMNLP, and NeurIPS papers published over the past five years. We formally define and quantify LLMs’ code synthesis ability under three core challenges: cross-document understanding, multi-file dependency reasoning, and scientific intent decoding. We propose a dual-track evaluation paradigm combining unit test pass rate with an LLM-assisted correctness score. Experiments across GPT-4, Claude, and Llama models—deployed with ReAct and Plan-and-Execute frameworks—reveal that state-of-the-art models achieve an average pass rate below 35%, substantially underperforming human reproducers. This gap highlights fundamental limitations in abstract-to-concrete grounding and coordinated execution of complex logical dependencies.

Technology Category

Application Category

📝 Abstract

Large language model (LLM) agents have demonstrated remarkable potential in advancing scientific discovery. However, their capability in the fundamental yet crucial task of reproducing code from research papers, especially in the NLP domain, remains underexplored. This task includes unique complex reasoning challenges in the intellectual synthesis of abstract concepts and the comprehension of code repositories with interdependent files. Motivated by this gap, we present LMR-BENCH, a benchmark designed to systematically evaluate the capability of LLM agents on code reproduction from Language Modeling Research. It consists of 28 code reproduction tasks derived from 23 research papers published in top-tier NLP venues over the past five years, spanning nine fundamental categories. Models are provided with a research paper, a code repository containing one or more masked functions, and instructions for implementing these functions. We conduct extensive experiments in standard prompting and LLM agent settings with state-of-the-art LLMs, evaluating the accuracy of unit tests and performing LLM-based evaluation of code correctness. Experimental results reveal that even the most advanced models still exhibit persistent limitations in scientific reasoning and code synthesis, highlighting critical gaps in LLM agents' ability to autonomously reproduce scientific research

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM agents' code reproduction from NLP papers

Assessing complex reasoning in abstract concept synthesis

Testing code correctness in interdependent repository contexts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark for LLM agent code reproduction evaluation

Masked functions in code repository for testing

LLM-based evaluation of code correctness

🔎 Similar Papers

LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks

2024-06-26arXiv.orgCitations: 69

💼 Related Jobs

Research Scientist, AI Language