How Robust are LLM-Generated Library Imports? An Empirical Study using Stack Overflow

📅 2025-07-14

📈 Citations: 0

✨ Influential: 0

career value

147K/year

🤖 AI Summary

Large language models (LLMs) exhibit structural deficiencies in dependency management when generating production-ready Python code, particularly in recommending installable and executable third-party libraries. Method: This study conducts the first systematic evaluation of six mainstream LLMs on real-world Stack Overflow Python questions, using a standardized benchmark that integrates prompt engineering, automated dependency parsing, and license analysis to quantify installability, naming consistency (i.e., alignment between package names and import identifiers), and deployment feasibility of recommended libraries. Contribution/Results: LLMs strongly favor third-party libraries, yet 4.6% of recommendations fail installation due to package–import name mismatches; only two models provide installation commands; and while most generated code is syntactically correct, it frequently lacks executable dependency support. The findings expose critical gaps in LLMs’ handling of software dependencies for production use and propose three concrete improvements to enhance library recommendation usability: (1) enforcing naming consistency, (2) integrating dependency resolution into generation, and (3) augmenting prompts with installation-context awareness.

Technology Category

Application Category

📝 Abstract

Software libraries are central to the functionality, security, and maintainability of modern code. As developers increasingly turn to Large Language Models (LLMs) to assist with programming tasks, understanding how these models recommend libraries is essential. In this paper, we conduct an empirical study of six state-of-the-art LLMs, both proprietary and open-source, by prompting them to solve real-world Python problems sourced from Stack Overflow. We analyze the types of libraries they import, the characteristics of those libraries, and the extent to which the recommendations are usable out of the box. Our results show that LLMs predominantly favour third-party libraries over standard ones, and often recommend mature, popular, and permissively licensed dependencies. However, we also identify gaps in usability: 4.6% of the libraries could not be resolved automatically due to structural mismatches between import names and installable packages, and only two models (out of six) provided installation guidance. While the generated code is technically valid, the lack of contextual support places the burden of manually resolving dependencies on the user. Our findings offer actionable insights for both developers and researchers, and highlight opportunities to improve the reliability and usability of LLM-generated code in the context of software dependencies.

Problem

Research questions and friction points this paper is trying to address.

Evaluates robustness of LLM-generated library imports in code

Analyzes usability gaps in LLM-recommended Python libraries

Identifies dependency resolution issues in LLM-generated code

Innovation

Methods, ideas, or system contributions that make the work stand out.

Empirical study of six state-of-the-art LLMs

Analyze library imports from Stack Overflow problems

Evaluate usability gaps in LLM-generated code

🔎 Similar Papers

We Have a Package for You! A Comprehensive Analysis of Package Hallucinations by Code Generating LLMs