Rethinking and Exploring String-Based Malware Family Classification in the Era of LLMs and RAG

📅 2025-07-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
With the rise of large language models (LLMs) and retrieval-augmented generation (RAG), the effectiveness of traditional binary string features for malware family classification (MFC) remains underexplored. Method: We propose Family-Specific Strings (FSS), modeling them as lightweight retrieval-augmented mechanisms. We construct a benchmark comprising 4,347 samples across 67 families, systematically extract and analyze over 25 million strings, and design a multi-module framework integrating FSS extraction, retrieval-based matching, and LLM fusion. Contribution/Results: Ablation studies demonstrate that FSS significantly improves both classification accuracy and interpretability. Our work validates the feasibility and advantages of reusing conventional string features within LLM/RAG paradigms, offering a low-overhead, highly interpretable approach to fine-grained malware identification.

Technology Category

Application Category

📝 Abstract
Malware Family Classification (MFC) aims to identify the fine-grained family (e.g., GuLoader or BitRAT) to which a potential malware sample belongs, in contrast to malware detection or sample classification that predicts only an Yes/No. Accurate family identification can greatly facilitate automated sample labeling and understanding on crowdsourced malware analysis platforms such as VirusTotal and MalwareBazaar, which generate vast amounts of data daily. In this paper, we explore and assess the feasibility of using traditional binary string features for MFC in the new era of large language models (LLMs) and Retrieval-Augmented Generation (RAG). Specifically, we investigate how Family-Specific String (FSS) features could be utilized in a manner similar to RAG to facilitate MFC. To this end, we develop a curated evaluation framework covering 4,347 samples from 67 malware families, extract and analyze over 25 million strings, and conduct detailed ablation studies to assess the impact of different design choices in four major modules.
Problem

Research questions and friction points this paper is trying to address.

Exploring string-based malware family classification using LLMs and RAG
Assessing feasibility of traditional binary string features in MFC
Investigating Family-Specific String features for improved malware identification
Innovation

Methods, ideas, or system contributions that make the work stand out.

Utilizes Family-Specific String (FSS) features
Applies Retrieval-Augmented Generation (RAG) techniques
Evaluates with 4,347 samples from 67 families
🔎 Similar Papers
No similar papers found.