🤖 AI Summary
Existing evaluations lack rigorous assessment of large language models’ (LLMs) capability to reason about code dependencies—a critical gap for reliable software synthesis.
Method: We introduce DI-BENCH, the first dependency inference benchmark targeting large-scale, testable open-source repositories—581 projects across Python, C#, Rust, and JavaScript, each with fully configured CI environments. DI-BENCH employs an end-to-end execution pass rate as its primary metric, integrating dual-track evaluation: (i) text-based similarity analysis and (ii) ground-truth feedback from compilation, execution, and test validation. It further features a cross-language, reproducible, and automated dependency resolution pipeline.
Contribution/Results: Experiments reveal that state-of-the-art LLMs achieve only 42.9% execution pass rate on DI-BENCH, exposing fundamental weaknesses in dependency identification and completion. DI-BENCH thus establishes the first quantitative, diagnosable, and scalable benchmark for evaluating LLMs’ dependency reasoning in software synthesis.
📝 Abstract
Large Language Models have advanced automated software development, however, it remains a challenge to correctly infer dependencies, namely, identifying the internal components and external packages required for a repository to successfully run. Existing studies highlight that dependency-related issues cause over 40% of observed runtime errors on the generated repository. To address this, we introduce DI-BENCH, a large-scale benchmark and evaluation framework specifically designed to assess LLMs' capability on dependency inference. The benchmark features 581 repositories with testing environments across Python, C#, Rust, and JavaScript. Extensive experiments with textual and execution-based metrics reveal that the current best-performing model achieves only a 42.9% execution pass rate, indicating significant room for improvement. DI-BENCH establishes a new viewpoint for evaluating LLM performance on repositories, paving the way for more robust end-to-end software synthesis.