MLRan: A Behavioural Dataset for Ransomware Analysis and Detection

📅 2025-05-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing public ransomware detection datasets suffer from limited scale, low diversity, narrow temporal coverage, and poor reproducibility. To address these limitations, this work introduces GUIDE-MLRan—a comprehensive construction methodology—and releases MLRan, the first large-scale, behavior-level ransomware dataset spanning 2006–2024. MLRan encompasses 64 ransomware families and balanced benign samples. Behavioral logs are collected via dynamic analysis, followed by a two-stage feature selection process—mutual information filtering and recursive feature elimination (RFE)—reducing 6.4 million raw features to 483 highly discriminative ones. Key malicious indicators identified include registry modifications, sensitive string patterns, and API abuse. Detection performance is validated across multiple models (Random Forest, XGBoost, SVM) and enhanced by SHAP/LIME-based interpretability analysis, achieving 98.7% accuracy, 98.9% precision, and 98.5% recall. All data, code, and construction pipelines are fully open-sourced to ensure complete reproducibility.

Technology Category

Application Category

📝 Abstract
Ransomware remains a critical threat to cybersecurity, yet publicly available datasets for training machine learning-based ransomware detection models are scarce and often have limited sample size, diversity, and reproducibility. In this paper, we introduce MLRan, a behavioural ransomware dataset, comprising over 4,800 samples across 64 ransomware families and a balanced set of goodware samples. The samples span from 2006 to 2024 and encompass the four major types of ransomware: locker, crypto, ransomware-as-a-service, and modern variants. We also propose guidelines (GUIDE-MLRan), inspired by previous work, for constructing high-quality behavioural ransomware datasets, which informed the curation of our dataset. We evaluated the ransomware detection performance of several machine learning (ML) models using MLRan. For this purpose, we performed feature selection by conducting mutual information filtering to reduce the initial 6.4 million features to 24,162, followed by recursive feature elimination, yielding 483 highly informative features. The ML models achieved an accuracy, precision and recall of up to 98.7%, 98.9%, 98.5%, respectively. Using SHAP and LIME, we identified critical indicators of malicious behaviour, including registry tampering, strings, and API misuse. The dataset and source code for feature extraction, selection, ML training, and evaluation are available publicly to support replicability and encourage future research, which can be found at https://github.com/faithfulco/mlran.
Problem

Research questions and friction points this paper is trying to address.

Scarce public datasets for ransomware detection models
Need diverse ransomware samples for accurate ML training
Identifying key behavioral features for ransomware detection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Behavioral dataset with 4,800 ransomware samples
Feature selection via mutual information filtering
SHAP and LIME for malicious behavior indicators
🔎 Similar Papers
No similar papers found.
F
F. C. Onwuegbuche
A
Adelodun Olaoluwa
A
A. Jurcut
Liliana Pasquale
Liliana Pasquale
University College Dublin