MalwarePT: A Binary-Level Foundation Model for Malware Analysis

📅 2026-05-15

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

Existing malware analysis approaches often rely on handcrafted features or byte-level models, which struggle to capture high-level code patterns and exhibit limited generalization. This work proposes the first general-purpose foundation model tailored for Windows PE binary code sections, innovatively integrating Byte Pair Encoding (BPE) into malware modeling to effectively compress frequent multi-byte sequences. Built upon the ModernBERT architecture, the model is pretrained via masked language modeling and supports transfer learning across multiple granularities—from tokens to entire programs. It significantly outperforms baseline methods on three tasks: API call prediction, functional classification, and malware detection. Notably, at a false positive rate of approximately 0.001, it surpasses existing neural network–based detectors and complements models relying on engineered PE structural features.

📝 Abstract

Automated malware analysis increasingly relies on machine learning, yet most existing methods remain task-specific and depend on handcrafted features or narrowly scoped models. Recent developments in binary-level foundation models suggest a path toward reusable program representations, but their application to malware analysis remains underexplored, and most still operate at byte-level tokenization, limiting their ability to capture multi-byte code patterns. In this work, we introduce MalwarePT, a binary-level foundation model for malware analysis built on a ModernBERT-style encoder and pretrained with masked language modeling on Windows PE code-section bytes. We study whether a single pretrained encoder can transfer across malware-analysis tasks at different granularities, and how tokenization design affects that transfer. We train a byte-pair encoding (BPE) tokenizer on code-section bytes to compress frequent multi-byte patterns within a fixed context budget. We evaluate MalwarePT on three downstream tasks spanning token-, function-, and document-level prediction: API call prediction, functionality classification, and malware (program) detection under temporal drift. Our evaluation demonstrates that pretraining yields substantial gains for API call prediction and functionality classification, and that increasing the BPE vocabulary beyond the byte-level baseline improves performance, with the strongest overall tradeoff at a vocabulary size of 1,024 tokens. In malware detection at FPR ~ 0.001, MalwarePT outperforms the neural network baselines, and is complementary to feature-engineering models that rely on PE structure. We also compare against existing binary foundation models and show that MalwarePT's design choices yield gains across all downstream tasks.

Problem

Research questions and friction points this paper is trying to address.

malware analysis

binary-level foundation model

tokenization

multi-byte code patterns

task-specific models

Innovation

Methods, ideas, or system contributions that make the work stand out.

binary-level foundation model

byte-pair encoding (BPE)

malware analysis