🤖 AI Summary
This work addresses the challenge of effectively modeling hierarchical semantic relationships between audio and visual modalities in traditional malware classification approaches. To this end, it introduces hyperbolic geometry into this domain for the first time and proposes a multimodal fusion framework based on Poincaré ball embeddings. The method employs a hyperbolic cross-attention mechanism and a cross-modal fusion strategy grounded in Möbius addition to achieve hierarchy-aware alignment and joint learning of audio and visual representations derived from binary files. Evaluated on the MalNet and CICMalDroid2020 datasets, the proposed approach substantially outperforms both unimodal baselines and state-of-the-art Euclidean multimodal methods, achieving the current best performance.
📝 Abstract
In this work, we introduce FOCA, a novel multimodal framework for malware classification that jointly leverages audio and visual modalities. Unlike conventional Euclidean-based fusion methods, FOCA is the first to exploit the intrinsic hierarchical relationships between audio and visual representations within hyperbolic space. To achieve this, raw binaries are transformed into both audio and visual representations, which are then processed through three key components: (i) a hyperbolic projection module that maps Euclidean embeddings into the Poincare ball, (ii) a hyperbolic cross-attention mechanism that aligns multimodal dependencies under curvature-aware constraints, and (iii) a Mobius addition-based fusion layer. Comprehensive experiments on two benchmark datasets-Mal-Net and CICMalDroid2020- show that FOCA consistently outperforms unimodal models, surpasses most Euclidean multimodal baselines, and achieves state-of-the-art performance over existing works.