DySec: A Machine Learning-based Dynamic Analysis for Detecting Malicious Packages in PyPI Ecosystem

📅 2025-03-01

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

To address sophisticated malicious package attacks in the PyPI ecosystem—such as typosquatting and stealthy remote access activation—this paper proposes an eBPF-based dynamic behavior detection framework operating during package installation. Methodologically, it pioneers the integration of eBPF kernel probes with user-space instrumentation to extract 36 fine-grained behavioral features—including system calls, network activity, file operations, and resource usage—in real time during installation. We construct the first large-scale, manually labeled PyPI dynamic behavior dataset comprising 14,271 packages and employ XGBoost and LightGBM for low-latency classification (<0.5 s). Evaluation demonstrates 95.99% accuracy, substantially reduced false positives, and false negative rates 78.65% and 82.24% lower than static and metadata-based approaches, respectively. Furthermore, our method identified six previously unknown malicious packages, leading to the removal of four from PyPI.

Technology Category

Application Category

📝 Abstract

Malicious Python packages make software supply chains vulnerable by exploiting trust in open-source repositories like Python Package Index (PyPI). Lack of real-time behavioral monitoring makes metadata inspection and static code analysis inadequate against advanced attack strategies such as typosquatting, covert remote access activation, and dynamic payload generation. To address these challenges, we introduce DySec, a machine learning (ML)-based dynamic analysis framework for PyPI that uses eBPF kernel and user-level probes to monitor behaviors during package installation. By capturing 36 real-time features-including system calls, network traffic, resource usage, directory access, and installation patterns-DySec detects threats like typosquatting, covert remote access activation, dynamic payload generation, and multiphase attack malware. We developed a comprehensive dataset of 14,271 Python packages, including 7,127 malicious sample traces, by executing them in a controlled isolated environment. Experimental results demonstrate that DySec achieves a 95.99% detection accuracy with a latency of<0.5s, reducing false negatives by 78.65% compared to static analysis and 82.24% compared to metadata analysis. During the evaluation, DySec flagged 11 packages that PyPI classified as benign. A manual analysis, including installation behavior inspection, confirmed six of them as malicious. These findings were reported to PyPI maintainers, resulting in the removal of four packages. DySec bridges the gap between reactive traditional methods and proactive, scalable threat mitigation in open-source ecosystems by uniquely detecting malicious install-time behaviors.

Problem

Research questions and friction points this paper is trying to address.

Detects malicious Python packages in PyPI ecosystem

Monitors real-time behaviors during package installation

Improves detection accuracy and reduces false negatives

Innovation

Methods, ideas, or system contributions that make the work stand out.

Machine learning-based dynamic analysis framework

Uses eBPF for real-time behavior monitoring

Detects malicious install-time behaviors effectively

🔎 Similar Papers

PackageIntel: Leveraging Large Language Models for Automated Intelligence Extraction in Package Ecosystems

2024-09-23arXiv.orgCitations: 2

Fidelity Investments

$107,000-216,000 USD per year

Jersey City, NJ / Westlake, TX

Software Engineer, Machine Learning