RepoST: Scalable Repository-Level Coding Environment Construction with Sandbox Testing

📅 2025-03-10

📈 Citations: 0

✨ Influential: 0

career value

154K/year

🤖 AI Summary

To address the challenges of costly full-repository builds and inefficient execution feedback in repository-level code generation, this paper proposes a sandbox-testing-driven paradigm for lightweight execution environment construction. Our method isolates the target function together with its minimal dependency set, enabling dynamic execution within an isolated sandbox to obtain precise, fine-grained feedback—bypassing the scalability bottlenecks of full-repository compilation. Key components include dependency-aware minimal extraction, automated test script generation, and construction of a large-scale function-level benchmark (RepoST-Train with 7,415 functions and RepoST-Eval). Experiments demonstrate substantial improvements in code model performance: Pass@1 increases by 5.5% on HumanEval and 3.5% on RepoEval. We further conduct systematic evaluation across 12 mainstream models. The proposed infrastructure enables highly scalable, low-coupling execution feedback for repository-level code generation.

Technology Category

Application Category

📝 Abstract

We present RepoST, a scalable method to construct environments that provide execution feedback for repository-level code generation for both training and evaluation. Unlike existing works that aim to build entire repositories for execution, which is challenging for both human and LLMs, we provide execution feedback with sandbox testing, which isolates a given target function and its dependencies to a separate script for testing. Sandbox testing reduces the complexity of external dependencies and enables constructing environments at a large scale. We use our method to construct RepoST-Train, a large-scale train set with 7,415 functions from 832 repositories. Training with the execution feedback provided by RepoST-Train leads to a performance gain of 5.5% Pass@1 on HumanEval and 3.5% Pass@1 on RepoEval. We also build an evaluation dataset, RepoST-Eval, and benchmark 12 code generation models.

Problem

Research questions and friction points this paper is trying to address.

Scalable repository-level code execution feedback

Sandbox testing for isolated function testing

Performance improvement in code generation models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sandbox testing isolates functions for execution feedback

Scalable environment construction with reduced dependency complexity

Large-scale training set enhances code generation performance

🔎 Similar Papers

R2C2-Coder: Enhancing and Benchmarking Real-world Repository-level Code Completion Abilities of Code Large Language Models