🤖 AI Summary
Existing packer detection methods rely on handcrafted heuristic features—such as high entropy or specific byte patterns—rendering them ineffective against low-entropy or adversarial packers, with limited generalizability and robustness. This paper proposes Pack-ALM, the first approach to adapt the linguistic distinction between “real words” and “non-words” to binary packer detection, framing it as an assembly instruction legitimacy classification task. Pack-ALM employs an end-to-end pretrained assembly language model to learn instruction-level semantic anomalies without manual feature engineering. Its pipeline comprises structured assembly preprocessing, pseudo-instruction generation, and contrastive pretraining. Evaluated on over 37,000 samples, Pack-ALM achieves significantly higher detection accuracy and superior adversarial robustness compared to conventional entropy-based methods and state-of-the-art models. Crucially, it demonstrates strong adaptability to previously unseen and low-entropy packers.
📝 Abstract
Detecting packed executables is a critical component of large-scale malware analysis and antivirus engine workflows, as it identifies samples that warrant computationally intensive dynamic unpacking to reveal concealed malicious behavior. Traditionally, packer detection techniques have relied on empirical features, such as high entropy or specific binary patterns. However, these empirical, feature-based methods are increasingly vulnerable to evasion by adversarial samples or unknown packers (e.g., low-entropy packers). Furthermore, the dependence on expert-crafted features poses challenges in sustaining and evolving these methods over time.
In this paper, we examine the limitations of existing packer detection methods and propose Pack-ALM, a novel deep-learning-based approach for detecting packed executables. Inspired by the linguistic concept of distinguishing between real and pseudo words, we reformulate packer detection as a task of differentiating between legitimate and "pseudo" instructions. To achieve this, we preprocess native data and packed data into "pseudo" instructions and design a pre-trained assembly language model that recognizes features indicative of packed data. We evaluate Pack-ALM against leading industrial packer detection tools and state-of-the-art assembly language models. Extensive experiments on over 37,000 samples demonstrate that Pack-ALM effectively identifies packed binaries, including samples created with adversarial or previously unseen packing techniques. Moreover, Pack-ALM outperforms traditional entropy-based methods and advanced assembly language models in both detection accuracy and adversarial robustness.