EnvBench: A Benchmark for Automated Environment Setup

📅 2025-03-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing automated environment configuration research relies on small-scale datasets that fail to reflect real-world complexity and challenges. Method: We introduce EnvBench, the first large-scale, high-difficulty benchmark for environment configuration, comprising 329 real-world Python and 665 JVM (Java/Kotlin) open-source repositories—each requiring non-trivial reasoning for successful configuration. We formally define and quantify repository-level configuration difficulty for the first time, and propose a dual-dimensional automatic evaluation framework: static analysis (e.g., missing import detection for Python) and compilation-based validation (e.g., JVM bytecode verification). The benchmark supports extensibility and model fine-tuning. Results: Evaluated on LLM-powered agent workflows driven by GPT-4o and GPT-4o-mini, state-of-the-art methods achieve only 6.69% success on Python repositories and 29.47% on JVM repositories, underscoring persistent limitations of current LLMs in practical environment configuration tasks.

Technology Category

Application Category

📝 Abstract
Recent advances in Large Language Models (LLMs) have enabled researchers to focus on practical repository-level tasks in software engineering domain. In this work, we consider a cornerstone task for automating work with software repositories-environment setup, i.e., a task of configuring a repository-specific development environment on a system. Existing studies on environment setup introduce innovative agentic strategies, but their evaluation is often based on small datasets that may not capture the full range of configuration challenges encountered in practice. To address this gap, we introduce a comprehensive environment setup benchmark EnvBench. It encompasses 329 Python and 665 JVM-based (Java, Kotlin) repositories, with a focus on repositories that present genuine configuration challenges, excluding projects that can be fully configured by simple deterministic scripts. To enable further benchmark extension and usage for model tuning, we implement two automatic metrics: a static analysis check for missing imports in Python and a compilation check for JVM languages. We demonstrate the applicability of our benchmark by evaluating three environment setup approaches, including a simple zero-shot baseline and two agentic workflows, that we test with two powerful LLM backbones, GPT-4o and GPT-4o-mini. The best approach manages to successfully configure 6.69% repositories for Python and 29.47% repositories for JVM, suggesting that EnvBench remains challenging for current approaches. Our benchmark suite is publicly available at https://github.com/JetBrains-Research/EnvBench. The dataset and experiment trajectories are available at https://jb.gg/envbench.
Problem

Research questions and friction points this paper is trying to address.

Automating software repository environment setup challenges
Evaluating environment setup strategies with comprehensive benchmarks
Addressing gaps in existing small dataset evaluations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces EnvBench for environment setup benchmarking
Includes 329 Python and 665 JVM-based repositories
Implements static analysis and compilation checks
🔎 Similar Papers
No similar papers found.