🤖 AI Summary
To address three critical challenges in open-source software (OSS) malicious package research—low dataset quality, insufficient diversity of malicious behaviors, and lack of attack context—this study constructs the largest empirical dataset to date (24,356 real-world malicious packages) and introduces the first malicious package knowledge graph for in-the-wild analysis. Methodologically, it integrates multi-source crawling and validation, cross-source overlap analysis, dependency-chain tracing, and temporal activity modeling. Key findings reveal: (1) highly dispersed origins of malicious packages (low inter-source overlap); (2) pervasive code reuse, rendering >90% semantically indistinguishable; (3) extremely short average lifespans for dependency-hiding packages; and (4) security advisories as the sole reliable source of attack context. The study further proposes a scalable OSS malicious behavior analysis framework and identifies 28 core malicious packages repeatedly injected into 1,354 legitimate package dependency trees.
📝 Abstract
The open-source software (OSS) ecosystem suffers from security threats caused by malware.However, OSS malware research has three limitations: a lack of high-quality datasets, a lack of malware diversity, and a lack of attack campaign contexts. In this paper, we first build the largest dataset of 24,356 malicious packages from online sources, then propose a knowledge graph to represent the OSS malware corpus and conduct malware analysis in the wild.Our main findings include (1) it is essential to collect malicious packages from various online sources because their data overlapping degrees are small;(2) despite the sheer volume of malicious packages, many reuse similar code, leading to a low diversity of malware;(3) only 28 malicious packages were repeatedly hidden via dependency libraries of 1,354 malicious packages, and dependency-hidden malware has a shorter active time;(4) security reports are the only reliable source for disclosing the malware-based context. Index Terms: Malicious Packages, Software Analysis