An Empirical Study of Vulnerabilities in Python Packages and Their Detection

📅 2025-09-04

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

Current Python package vulnerability detection lacks a high-precision, large-scale benchmark, hindering rigorous tool evaluation and advancement. To address this, we propose PyVul—the first comprehensive, manually validated benchmark for Python package vulnerabilities, encompassing 1,157 developer-confirmed real-world vulnerabilities annotated at both commit-level and function-level granularities. Our methodology introduces an LLM-assisted curation pipeline, achieving 100% commit-level and 94% function-level annotation accuracy; further, by integrating CWE-pattern recognition and cross-language (e.g., C/Python) project analysis, we empirically reveal significantly higher vulnerability susceptibility in mixed-language projects. PyVul establishes a reliable, language-agnostic foundation for cross-language vulnerability detection research. Empirical evaluation using PyVul exposes substantial coverage gaps in state-of-the-art tools—particularly for multi-language dependencies and diverse vulnerability types—thereby highlighting critical directions for methodological improvement.

Technology Category

Application Category

📝 Abstract

In the rapidly evolving software development landscape, Python stands out for its simplicity, versatility, and extensive ecosystem. Python packages, as units of organization, reusability, and distribution, have become a pressing concern, highlighted by the considerable number of vulnerability reports. As a scripting language, Python often cooperates with other languages for performance or interoperability. This adds complexity to the vulnerabilities inherent to Python packages, and the effectiveness of current vulnerability detection tools remains underexplored. This paper addresses these gaps by introducing PyVul, the first comprehensive benchmark suite of Python-package vulnerabilities. PyVul includes 1,157 publicly reported, developer-verified vulnerabilities, each linked to its affected packages. To accommodate diverse detection techniques, it provides annotations at both commit and function levels. An LLM-assisted data cleansing method is incorporated to improve label accuracy, achieving 100% commit-level and 94% function-level accuracy, establishing PyVul as the most precise large-scale Python vulnerability benchmark. We further carry out a distribution analysis of PyVul, which demonstrates that vulnerabilities in Python packages involve multiple programming languages and exhibit a wide variety of types. Moreover, our analysis reveals that multi-lingual Python packages are potentially more susceptible to vulnerabilities. Evaluation of state-of-the-art detectors using this benchmark reveals a significant discrepancy between the capabilities of existing tools and the demands of effectively identifying real-world security issues in Python packages. Additionally, we conduct an empirical review of the top-ranked CWEs observed in Python packages, to diagnose the fine-grained limitations of current detection tools and highlight the necessity for future advancements in the field.

Problem

Research questions and friction points this paper is trying to address.

Studying vulnerabilities in Python packages and detection methods

Evaluating effectiveness of current vulnerability detection tools

Analyzing multi-lingual nature and types of Python vulnerabilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces PyVul benchmark suite for Python vulnerabilities

Uses LLM-assisted data cleansing for improved label accuracy

Provides commit and function level annotations for detection

🔎 Similar Papers

No similar papers found.