🤖 AI Summary
This work addresses key limitations of large language models (LLMs) in software security—namely, inadequate semantic modeling, poor interpretability, and deployment challenges—by proposing the first LLM-based semantic-structural joint modeling framework for static malware analysis. Methodologically, it integrates assembly/source-code tokenization, cross-language malicious pattern embedding, and domain-specific fine-tuning to build a lightweight, interpretable, security-specialized Transformer model. Contributions include: (1) establishing a research taxonomy and challenge classification framework for LLM applications in security; (2) systematically unifying over 30 mainstream LLMs and 15 domain-specific datasets (e.g., EMBER, MalwareBench); and (3) achieving a 12–18% improvement in zero-day malware detection accuracy while enabling real-time threat inference. The framework provides both theoretical foundations and practical pathways for automated reverse engineering and next-generation malware detection.
📝 Abstract
Large Language Models (LLMs) have recently emerged as powerful tools in cybersecurity, offering advanced capabilities in malware detection, generation, and real-time monitoring. Numerous studies have explored their application in cybersecurity, demonstrating their effectiveness in identifying novel malware variants, analyzing malicious code structures, and enhancing automated threat analysis. Several transformer-based architectures and LLM-driven models have been proposed to improve malware analysis, leveraging semantic and structural insights to recognize malicious intent more accurately. This study presents a comprehensive review of LLM-based approaches in malware code analysis, summarizing recent advancements, trends, and methodologies. We examine notable scholarly works to map the research landscape, identify key challenges, and highlight emerging innovations in LLM-driven cybersecurity. Additionally, we emphasize the role of static analysis in malware detection, introduce notable datasets and specialized LLM models, and discuss essential datasets supporting automated malware research. This study serves as a valuable resource for researchers and cybersecurity professionals, offering insights into LLM-powered malware detection and defence strategies while outlining future directions for strengthening cybersecurity resilience.