DI-BENCH: Benchmarking Large Language Models on Dependency Inference with Testable Repositories at Scale

📅 2025-01-23

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

Existing evaluations lack rigorous assessment of large language models’ (LLMs) capability to reason about code dependencies—a critical gap for reliable software synthesis. Method: We introduce DI-BENCH, the first dependency inference benchmark targeting large-scale, testable open-source repositories—581 projects across Python, C#, Rust, and JavaScript, each with fully configured CI environments. DI-BENCH employs an end-to-end execution pass rate as its primary metric, integrating dual-track evaluation: (i) text-based similarity analysis and (ii) ground-truth feedback from compilation, execution, and test validation. It further features a cross-language, reproducible, and automated dependency resolution pipeline. Contribution/Results: Experiments reveal that state-of-the-art LLMs achieve only 42.9% execution pass rate on DI-BENCH, exposing fundamental weaknesses in dependency identification and completion. DI-BENCH thus establishes the first quantitative, diagnosable, and scalable benchmark for evaluating LLMs’ dependency reasoning in software synthesis.

Technology Category

Application Category

📝 Abstract

Large Language Models have advanced automated software development, however, it remains a challenge to correctly infer dependencies, namely, identifying the internal components and external packages required for a repository to successfully run. Existing studies highlight that dependency-related issues cause over 40% of observed runtime errors on the generated repository. To address this, we introduce DI-BENCH, a large-scale benchmark and evaluation framework specifically designed to assess LLMs' capability on dependency inference. The benchmark features 581 repositories with testing environments across Python, C#, Rust, and JavaScript. Extensive experiments with textual and execution-based metrics reveal that the current best-performing model achieves only a 42.9% execution pass rate, indicating significant room for improvement. DI-BENCH establishes a new viewpoint for evaluating LLM performance on repositories, paving the way for more robust end-to-end software synthesis.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Code Dependency Resolution

Software Development Automation

Innovation

Methods, ideas, or system contributions that make the work stand out.

DI-BENCH

Large Language Models

Dependency Detection

🔎 Similar Papers

No similar papers found.