Zipf-Gramming: Scaling Byte N-Grams Up to Production Sized Malware Corpora

📅 2025-11-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the inefficiency of extracting high-frequency 6–8 byte n-grams from terabyte-scale executables—hindering frequent retraining of malware detection models—this paper proposes Zipf-Gramming, a novel n-gram extraction algorithm. It is the first to deeply integrate Zipf’s law into byte-level n-gram engineering, leveraging frequency estimation and approximate sorting to skip low-frequency n-gram computations, thereby eliminating substantial redundant overhead. Evaluated on real-world production data, Zipf-Gramming achieves up to 35× faster top-k high-frequency n-gram extraction compared to state-of-the-art methods, supporting GB/s throughput and sub-10 ms latency. Models built upon these features remain under 2 MB in size while improving AUC by up to 30%. This work establishes a deployable, high-throughput feature extraction paradigm for large-scale, rapidly updated binary malware detection.

Technology Category

Application Category

📝 Abstract
A classifier using byte n-grams as features is the only approach we have found fast enough to meet requirements in size (sub 2 MB), speed (multiple GB/s), and latency (sub 10 ms) for deployment in numerous malware detection scenarios. However, we've consistently found that 6-8 grams achieve the best accuracy on our production deployments but have been unable to deploy regularly updated models due to the high cost of finding the top-k most frequent n-grams over terabytes of executable programs. Because the Zipfian distribution well models the distribution of n-grams, we exploit its properties to develop a new top-k n-gram extractor that is up to $35 imes$ faster than the previous best alternative. Using our new Zipf-Gramming algorithm, we are able to scale up our production training set and obtain up to 30% improvement in AUC at detecting new malware. We show theoretically and empirically that our approach will select the top-k items with little error and the interplay between theory and engineering required to achieve these results.
Problem

Research questions and friction points this paper is trying to address.

Scaling byte n-gram extraction for large malware corpora efficiently
Reducing computational cost for frequent n-gram discovery in terabytes
Improving malware detection accuracy through optimized feature selection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Zipf-Gramming algorithm accelerates top-k n-gram extraction
Leverages Zipfian distribution for 35x faster malware feature processing
Enables frequent model updates with 30% AUC improvement
🔎 Similar Papers
No similar papers found.