CSR-Bench: Benchmarking LLM Agents in Deployment of Computer Science Research Repositories

📅 2025-02-10

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

Existing evaluations of large language models (LLMs) lack benchmarks for automating deployment of complex scientific codebases—particularly in computer science research. Method: We introduce CSR-Bench, the first benchmark specifically designed for scientific repository deployment, covering NLP, CV, AI, ML, and data mining domains. We propose CSR-Agents, a multi-agent framework that jointly parses Markdown instructions and infers repository structure to autonomously generate, refine, and execute bash commands for end-to-end experimental environment setup. The approach integrates multi-LLM collaborative reasoning, structured repository parsing, instruction-driven script generation, and execution feedback loops, validated on real-world GitHub open-source projects. Contribution/Results: Experiments demonstrate substantial improvements in deployment success rate and developer productivity, confirming that LLM-based agents can reliably orchestrate high-complexity, multi-step scientific codebase deployments without human intervention.

Technology Category

Application Category

📝 Abstract

The increasing complexity of computer science research projects demands more effective tools for deploying code repositories. Large Language Models (LLMs), such as Anthropic Claude and Meta Llama, have demonstrated significant advancements across various fields of computer science research, including the automation of diverse software engineering tasks. To evaluate the effectiveness of LLMs in handling complex code development tasks of research projects, particularly for NLP/CV/AI/ML/DM topics, we introduce CSR-Bench, a benchmark for Computer Science Research projects. This benchmark assesses LLMs from various aspects including accuracy, efficiency, and deployment script quality, aiming to explore their potential in conducting computer science research autonomously. We also introduce a novel framework, CSR-Agents, that utilizes multiple LLM agents to automate the deployment of GitHub code repositories of computer science research projects. Specifically, by checking instructions from markdown files and interpreting repository structures, the model generates and iteratively improves bash commands that set up the experimental environments and deploy the code to conduct research tasks. Preliminary results from CSR-Bench indicate that LLM agents can significantly enhance the workflow of repository deployment, thereby boosting developer productivity and improving the management of developmental workflows.

Problem

Research questions and friction points this paper is trying to address.

Benchmarking LLM agents in code repository deployment

Assessing LLMs for complex CS research tasks

Automating GitHub repository deployment with CSR-Agents

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs automate GitHub repository deployment

CSR-Bench evaluates LLM accuracy, efficiency

CSR-Agents improve bash command generation

🔎 Similar Papers

ResearchArena: Benchmarking Large Language Models' Ability to Collect and Organize Information as Research Agents