🤖 AI Summary
Existing malware analysis methods suffer from a lack of large-scale graph datasets capable of capturing software’s intrinsic hierarchical structure; most approaches oversimplify programs into flat, single-layer graphs, failing to model semantic correlations between instruction-level logic and function-level interactions.
Method: We introduce HG-Malware—the first ultra-large-scale hierarchical graph dataset—comprising 595,000 function call graphs (FCGs), each hierarchically embedded with up to billions of control flow graphs (CFGs), fully preserving multi-granularity semantic structure. We further propose a dual-level graph representation framework compatible with both GNNs and structured program analysis.
Contribution/Results: Empirical evaluation demonstrates that HG-Malware significantly improves robustness against code evolution and obfuscation, uncovers fundamental topological distinctions between benign and malicious software at multiple abstraction levels, and establishes the first open, reproducible hierarchical graph benchmark for the research community.
📝 Abstract
The advancement of graph-based malware analysis is critically limited by the absence of large-scale datasets that capture the inherent hierarchical structure of software. Existing methods often oversimplify programs into single level graphs, failing to model the crucial semantic relationship between high-level functional interactions and low-level instruction logic. To bridge this gap, we introduce dataset, the largest public hierarchical graph dataset for malware analysis, comprising over extbf{200M} Control Flow Graphs (CFGs) nested within extbf{595K} Function Call Graphs (FCGs). This two-level representation preserves structural semantics essential for building robust detectors resilient to code obfuscation and malware evolution. We demonstrate HiGraph's utility through a large-scale analysis that reveals distinct structural properties of benign and malicious software, establishing it as a foundational benchmark for the community. The dataset and tools are publicly available at https://higraph.org.