🤖 AI Summary
To address the challenges of costly full-repository builds and inefficient execution feedback in repository-level code generation, this paper proposes a sandbox-testing-driven paradigm for lightweight execution environment construction. Our method isolates the target function together with its minimal dependency set, enabling dynamic execution within an isolated sandbox to obtain precise, fine-grained feedback—bypassing the scalability bottlenecks of full-repository compilation. Key components include dependency-aware minimal extraction, automated test script generation, and construction of a large-scale function-level benchmark (RepoST-Train with 7,415 functions and RepoST-Eval). Experiments demonstrate substantial improvements in code model performance: Pass@1 increases by 5.5% on HumanEval and 3.5% on RepoEval. We further conduct systematic evaluation across 12 mainstream models. The proposed infrastructure enables highly scalable, low-coupling execution feedback for repository-level code generation.
📝 Abstract
We present RepoST, a scalable method to construct environments that provide execution feedback for repository-level code generation for both training and evaluation. Unlike existing works that aim to build entire repositories for execution, which is challenging for both human and LLMs, we provide execution feedback with sandbox testing, which isolates a given target function and its dependencies to a separate script for testing. Sandbox testing reduces the complexity of external dependencies and enables constructing environments at a large scale. We use our method to construct RepoST-Train, a large-scale train set with 7,415 functions from 832 repositories. Training with the execution feedback provided by RepoST-Train leads to a performance gain of 5.5% Pass@1 on HumanEval and 3.5% Pass@1 on RepoEval. We also build an evaluation dataset, RepoST-Eval, and benchmark 12 code generation models.