Malware Classification Leveraging NLP&Machine Learning for Enhanced Accuracy

📅 2025-06-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the redundancy and insufficient discriminability of static features in fine-grained malware classification, this work systematically adapts NLP-inspired n-gram modeling to API call sequences and executable file text analysis, thereby constructing semantically enriched sequential features. Integrating TF-IDF representation with a hybrid feature selection strategy—combining chi-square test and mutual information—we retain only 1.6% of the original features while preserving high discriminative power. Evaluated on a real-world malware dataset, an ensemble of SVM, Random Forest, and XGBoost achieves a 99.02% average classification accuracy, significantly outperforming conventional approaches; feature dimensionality is reduced by 98.4%. The core contribution lies in the cross-domain transfer of n-gram semantic modeling to malware static analysis, coupled with a lightweight yet effective feature selection mechanism—overcoming both the dimensionality curse and discriminability limitations inherent in traditional static analysis.

Technology Category

Application Category

📝 Abstract
This paper investigates the application of natural language processing (NLP)-based n-gram analysis and machine learning techniques to enhance malware classification. We explore how NLP can be used to extract and analyze textual features from malware samples through n-grams, contiguous string or API call sequences. This approach effectively captures distinctive linguistic patterns among malware and benign families, enabling finer-grained classification. We delve into n-gram size selection, feature representation, and classification algorithms. While evaluating our proposed method on real-world malware samples, we observe significantly improved accuracy compared to the traditional methods. By implementing our n-gram approach, we achieved an accuracy of 99.02% across various machine learning algorithms by using hybrid feature selection technique to address high dimensionality. Hybrid feature selection technique reduces the feature set to only 1.6% of the original features.
Problem

Research questions and friction points this paper is trying to address.

Enhancing malware classification using NLP and machine learning
Analyzing malware textual features via n-gram sequences
Improving accuracy with hybrid feature selection techniques
Innovation

Methods, ideas, or system contributions that make the work stand out.

NLP-based n-gram analysis for malware classification
Hybrid feature selection to reduce dimensionality
Machine learning algorithms achieving 99.02% accuracy
🔎 Similar Papers
No similar papers found.