MASCOT: Analyzing Malware Evolution Through A Well-Curated Source Code Dataset

📅 2025-11-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Rapid malware evolution and pervasive code reuse complicate lineage inference, hindering threat attribution and defense. Method: We construct a curated dataset of 6,032 manually verified malware source-code samples and, for the first time, systematically quantify engineering attributes—including scale, development cost, code quality, security practices, and dependency structure—from a software engineering perspective. We propose a multi-view lineage analysis framework that jointly leverages code similarity, dependency graphs, and multidimensional engineering metrics to quantify association strength, reconstruct individual evolutionary trajectories, and generate interpretable lineage visualizations. Results: Empirical analysis reveals a persistent increase in malware complexity and standardization, yet significant code-quality deficiencies persist. Our approach effectively uncovers familial derivation patterns, cross-family module reuse, and ecosystem-level evolutionary dynamics—establishing a novel, engineering-informed paradigm for malware threat intelligence, attribution, and proactive defense strategy formulation.

Technology Category

Application Category

📝 Abstract
In recent years, the explosion of malware and extensive code reuse have formed complex evolutionary connections among malware specimens. The rapid pace of development makes it challenging for existing studies to characterize recent evolutionary trends. In addition, intuitive tools to untangle these intricate connections between malware specimens or categories are urgently needed. This paper introduces a manually-reviewed malware source code dataset containing 6032 specimens. Building on and extending current research from a software engineering perspective, we systematically evaluate the scale, development costs, code quality, as well as security and dependencies of modern malware. We further introduce a multi-view genealogy analysis to clarify malware connections: at an overall view, this analysis quantifies the strength and direction of connections among specimens and categories; at a detailed view, it traces the evolutionary histories of individual specimens. Experimental results indicate that, despite persistent shortcomings in code quality, malware specimens exhibit an increasing complexity and standardization, in step with the development of mainstream software engineering practices. Meanwhile, our genealogy analysis intuitively reveals lineage expansion and evolution driven by code reuse, providing new evidence and tools for understanding the formation and evolution of the malware ecosystem.
Problem

Research questions and friction points this paper is trying to address.

Analyzing malware evolution through a curated source code dataset
Systematically evaluating malware scale, cost, quality, security, and dependencies
Introducing multi-view genealogy analysis to clarify malware connections and lineages
Innovation

Methods, ideas, or system contributions that make the work stand out.

Manually curated malware source code dataset
Multi-view genealogy analysis for malware connections
Evaluates malware complexity, standardization, and evolution
🔎 Similar Papers
No similar papers found.
B
Bojing Li
Department of Computer Science and Electrical Engineering, Baltimore, Maryland, USA
D
Duo Zhong
Department of Computer Science and Electrical Engineering, Baltimore, Maryland, USA
D
Dharani Nadendla
Department of Computer Science and Electrical Engineering, Baltimore, Maryland, USA
G
Gabriel Terceros
Department of Computer Science and Electrical Engineering, Baltimore, Maryland, USA
P
Prajna Bhandar
Department of Computer Science and Electrical Engineering, Baltimore, Maryland, USA
R
Raguvir S
Department of Computer Science and Electrical Engineering, Baltimore, Maryland, USA
Charles Nicholas
Charles Nicholas
UMBC
computer science