LMDG: Advancing Lateral Movement Detection Through High-Fidelity Dataset Generation

📅 2025-08-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Horizontal movement (HM) attack detection in enterprise environments has long been hindered by the lack of realistic, fine-grained labeled datasets. To address this, we propose LMDG—a novel framework that enables end-to-end HM attack provenance tracing and MITRE ATT&CK tactic-level labeling via process-tree modeling, supporting reproducible, dual-layer (system- and network-level) fine-grained annotation for multi-stage HM attacks. LMDG integrates automated benign behavior simulation, orchestrated multi-stage attack execution, process-tree analysis, and system-level logging. Deployed in a 25-node virtualized enterprise environment, it generates a 25-day, 944 GB high-fidelity dataset containing 35 realistic multi-stage HM attacks (<1% malicious events), covering diverse lateral movement scenarios. Compared to existing datasets, LMDG significantly advances scale, realism, and labeling accuracy, establishing a new benchmark for training and evaluating HM detection models.

Technology Category

Application Category

📝 Abstract
Lateral Movement (LM) attacks continue to pose a significant threat to enterprise security, enabling adversaries to stealthily compromise critical assets. However, the development and evaluation of LM detection systems are impeded by the absence of realistic, well-labeled datasets. To address this gap, we propose LMDG, a reproducible and extensible framework for generating high-fidelity LM datasets. LMDG automates benign activity generation, multi-stage attack execution, and comprehensive labeling of system and network logs, dramatically reducing manual effort and enabling scalable dataset creation. A central contribution of LMDG is Process Tree Labeling, a novel agent-based technique that traces all malicious activity back to its origin with high precision. Unlike prior methods such as Injection Timing or Behavioral Profiling, Process Tree Labeling enables accurate, step-wise labeling of malicious log entries, correlating each with a specific attack step and MITRE ATT&CK TTPs. To our knowledge, this is the first approach to support fine-grained labeling of multi-step attacks, providing critical context for detection models such as attack path reconstruction. We used LMDG to generate a 25-day dataset within a 25-VM enterprise environment containing 22 user accounts. The dataset includes 944 GB of host and network logs and embeds 35 multi-stage LM attacks, with malicious events comprising less than 1% of total activity, reflecting a realistic benign-to-malicious ratio for evaluating detection systems. LMDG-generated datasets improve upon existing ones by offering diverse LM attacks, up-to-date attack patterns, longer attack timeframes, comprehensive data sources, realistic network architectures, and more accurate labeling.
Problem

Research questions and friction points this paper is trying to address.

Lack of realistic labeled datasets for LM attack detection
Difficulty in accurate step-wise labeling of malicious logs
Need for scalable high-fidelity LM dataset generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automates benign activity and multi-stage attack generation
Introduces Process Tree Labeling for precise malicious tracing
Generates high-fidelity datasets with realistic attack ratios
🔎 Similar Papers
No similar papers found.