🤖 AI Summary
This study addresses the limitations of current large language models in malware provenance, which often lack code-level evidence and key indicators necessary for precise identification of malicious or vulnerable code. To bridge this gap, the authors propose LCCD, a code-centric benchmark dataset, and a seven-layer retrieval-augmented reasoning architecture that integrates multimodal features—including decompiled C code, assembly, and control flow graphs—with cybersecurity knowledge such as CFG/FCG representations, MITRE ATT&CK mappings, and suspicious API usage. The framework employs Chain-of-Verification (CoVe) and multidimensional quality gating to enhance result reliability. Orchestrated via LangGraph and leveraging QLoRA-finetuned Qwen models, the approach achieves an average semantic similarity of 0.634 across 43 tasks, excelling in IoC extraction and structured report generation, and produces analyst-ready outputs for all ten real-world samples tested.
📝 Abstract
LLMs are increasingly explored for malware analysis; however, current LLM-based malware attribution remains limited by unsupported indicators and insufficient code-level grounding for identifying malicious and vulnerable code segments. To address these limitations, this research introduces LCC-LLM, a code-centric benchmark dataset and evidence-grounded framework for malware attribution and multi-task static malware analysis. The proposed LCCD dataset contains approximately 34K PE samples processed through a large-scale reverse-engineering pipeline and represented using decompiled C code, assembly code, CFG/FCG artifacts, hexadecimal data, PE metadata, suspicious API evidence, and structural features. Beyond dataset construction, LCC-LLM integrates LangGraph-orchestrated static analysis with multi-source cybersecurity knowledge to support evidence-grounded malware reasoning. The framework employs a seven-layer retrieval-augmented generation pipeline, CoVe for IoC validation, and a multi-dimensional quality gate to improve factual reliability and analyst-oriented decision support. Curriculum-ordered instruction data is used to fine-tune DeepSeek-R1-Distill-Qwen-14B and Qwen3-Coder-30B-A3B using QLoRA. Evaluation across 43 malware-analysis task types achieves an average semantic similarity of 0.634, with the highest task-level performance in structured report generation, IoC extraction, vulnerability assessment, malware configuration extraction, and malware class detection. In a real-world case study using MalwareBazaar samples, the grounded pipeline achieves a 10/10 structured analysis pass rate, producing CFG/FCG evidence, MITRE ATT&CK mappings, detection guidance, and analyst-ready reports. These results show that code-centric representations, retrieval grounding, and verification-guided reasoning improve the reliability and operational usefulness of LLM-assisted malware attribution.