An Analysis of Malicious Packages in Open-Source Software in the Wild

📅 2024-04-07

📈 Citations: 1

✨ Influential: 0

career value

221K/year

🤖 AI Summary

To address three critical challenges in open-source software (OSS) malicious package research—low dataset quality, insufficient diversity of malicious behaviors, and lack of attack context—this study constructs the largest empirical dataset to date (24,356 real-world malicious packages) and introduces the first malicious package knowledge graph for in-the-wild analysis. Methodologically, it integrates multi-source crawling and validation, cross-source overlap analysis, dependency-chain tracing, and temporal activity modeling. Key findings reveal: (1) highly dispersed origins of malicious packages (low inter-source overlap); (2) pervasive code reuse, rendering >90% semantically indistinguishable; (3) extremely short average lifespans for dependency-hiding packages; and (4) security advisories as the sole reliable source of attack context. The study further proposes a scalable OSS malicious behavior analysis framework and identifies 28 core malicious packages repeatedly injected into 1,354 legitimate package dependency trees.

Technology Category

Application Category

📝 Abstract

The open-source software (OSS) ecosystem suffers from security threats caused by malware.However, OSS malware research has three limitations: a lack of high-quality datasets, a lack of malware diversity, and a lack of attack campaign contexts. In this paper, we first build the largest dataset of 24,356 malicious packages from online sources, then propose a knowledge graph to represent the OSS malware corpus and conduct malware analysis in the wild.Our main findings include (1) it is essential to collect malicious packages from various online sources because their data overlapping degrees are small;(2) despite the sheer volume of malicious packages, many reuse similar code, leading to a low diversity of malware;(3) only 28 malicious packages were repeatedly hidden via dependency libraries of 1,354 malicious packages, and dependency-hidden malware has a shorter active time;(4) security reports are the only reliable source for disclosing the malware-based context. Index Terms: Malicious Packages, Software Analysis

Problem

Research questions and friction points this paper is trying to address.

Lack of high-quality datasets for OSS malware research

Low diversity in malicious package code reuse

Limited understanding of malware attack campaign contexts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Largest dataset of 24,356 malicious packages

Knowledge graph for OSS malware analysis

Dependency-hidden malware has short active time

🔎 Similar Papers

PVAC: package version activity categorizer, leveraging semantic versioning in a heterogeneous system