DI-BENCH: Benchmarking Large Language Models on Dependency Inference with Testable Repositories at Scale

📅 2025-01-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing evaluations lack rigorous assessment of large language models’ (LLMs) capability to reason about code dependencies—a critical gap for reliable software synthesis. Method: We introduce DI-BENCH, the first dependency inference benchmark targeting large-scale, testable open-source repositories—581 projects across Python, C#, Rust, and JavaScript, each with fully configured CI environments. DI-BENCH employs an end-to-end execution pass rate as its primary metric, integrating dual-track evaluation: (i) text-based similarity analysis and (ii) ground-truth feedback from compilation, execution, and test validation. It further features a cross-language, reproducible, and automated dependency resolution pipeline. Contribution/Results: Experiments reveal that state-of-the-art LLMs achieve only 42.9% execution pass rate on DI-BENCH, exposing fundamental weaknesses in dependency identification and completion. DI-BENCH thus establishes the first quantitative, diagnosable, and scalable benchmark for evaluating LLMs’ dependency reasoning in software synthesis.

Technology Category

Application Category

📝 Abstract
Large Language Models have advanced automated software development, however, it remains a challenge to correctly infer dependencies, namely, identifying the internal components and external packages required for a repository to successfully run. Existing studies highlight that dependency-related issues cause over 40% of observed runtime errors on the generated repository. To address this, we introduce DI-BENCH, a large-scale benchmark and evaluation framework specifically designed to assess LLMs' capability on dependency inference. The benchmark features 581 repositories with testing environments across Python, C#, Rust, and JavaScript. Extensive experiments with textual and execution-based metrics reveal that the current best-performing model achieves only a 42.9% execution pass rate, indicating significant room for improvement. DI-BENCH establishes a new viewpoint for evaluating LLM performance on repositories, paving the way for more robust end-to-end software synthesis.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Code Dependency Resolution
Software Development Automation
Innovation

Methods, ideas, or system contributions that make the work stand out.

DI-BENCH
Large Language Models
Dependency Detection
🔎 Similar Papers
No similar papers found.
L
Linghao Zhang
Microsoft
J
Junhao Wang
Tongji University
Shilin He
Shilin He
Microsoft Research
LLMSoftware EngineeringNLP
Chaoyun Zhang
Chaoyun Zhang
Microsoft
GUI AgentLLMCausal InferenceAIOpsSpatio-temporal Modelling
Y
Yu Kang
Microsoft
B
Bowen Li
Shanghai AI Laboratory
J
Jiaheng Wen
Zhejiang University
C
Chengxing Xie
Shanghai AI Laboratory
M
Maoquan Wang
Microsoft
Y
Yufan Huang
Microsoft
E
Elsie Nallipogu
Microsoft
Qingwei Lin
Qingwei Lin
Microsoft
Yingnong Dang
Yingnong Dang
Microsoft
Cloud servicedata analyticssoftware analyticsmachine learninghuman-computer interaction
S
Saravan Rajmohan
Microsoft
Dongmei Zhang
Dongmei Zhang
Microsoft Research
Software EngineeringMachine LearningInformation Visualization
Q
Qi Zhang
Microsoft