ML-Based Behavioral Malware Detection Is Far From a Solved Problem

📅 2024-05-09
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Existing behavioral malware detection research heavily relies on sandbox-derived features, leading to severe performance degradation—accuracy drops to 20%–50%, far below the reported >90%—when deployed on real endpoints. This exposes three fundamental challenges: label noise, distribution shift, and spurious feature reliance. To address this, we conduct the first large-scale empirical evaluation on real endpoints and propose a robust end-to-end training framework tailored for endpoint environments. Our approach integrates behavioral trajectory modeling, cross-environment distribution alignment, noise-robust learning, and telemetry-driven training leveraging real-world endpoint telemetry. Experiments demonstrate a 5%–30% relative improvement in detection accuracy over sandbox-trained baselines. Crucially, our work shifts the detection paradigm from sandbox-centric training to direct training on endpoint data. As part of this contribution, we release the first publicly available benchmark dataset comprising authentic endpoint behavioral trajectories, enabling reproducible, realistic evaluation of behavioral malware detection systems.

Technology Category

Application Category

📝 Abstract
Malware detection is a ubiquitous application of Machine Learning (ML) in security. In behavioral malware analysis, the detector relies on features extracted from program execution traces. The research literature has focused on detectors trained with features collected from sandbox environments and evaluated on samples also analyzed in a sandbox. However, in deployment, a malware detector at endpoint hosts often must rely on traces captured from endpoint hosts, not from a sandbox. Thus, there is a gap between the literature and real-world needs. We present the first measurement study of the performance of ML-based malware detectors at real-world endpoints. Leveraging a dataset of sandbox traces and a dataset of in-the-wild program traces, we evaluate two scenarios: (i) an endpoint detector trained on sandbox traces (convenient and easy to train), and (ii) an endpoint detector trained on endpoint traces (more challenging to train, since we need to collect telemetry data). We discover a wide gap between the performance as measured using prior evaluation methods in the literature -- over 90% -- vs. expected performance in endpoint detection -- about 20% (scenario (i)) to 50% (scenario (ii)). We characterize the ML challenges that arise in this domain and contribute to this gap, including label noise, distribution shift, and spurious features. Moreover, we show several techniques that achieve 5--30% relative performance improvements over the baselines. Our evidence suggests that applying detectors trained on sandbox data to endpoint detection is challenging. The most promising direction is training detectors directly on endpoint data, which marks a departure from current practice. To promote progress, we will facilitate researchers to perform realistic detector evaluations against our real-world dataset.
Problem

Research questions and friction points this paper is trying to address.

ML-based malware detection faces a gap between sandbox and real-world endpoint performance.
Challenges include label noise, distribution shift, and spurious features in endpoint detection.
Training detectors on endpoint data improves performance over sandbox-trained models.
Innovation

Methods, ideas, or system contributions that make the work stand out.

ML-based malware detection using endpoint traces
Performance gap between sandbox and endpoint data
Techniques improving detection by 5-30%
🔎 Similar Papers
No similar papers found.