π€ AI Summary
This work addresses the pervasive issue of frequency and difficulty imbalances in real-world vulnerability detection data, which distort the geometry of embedding spaces. It proposes a unified framework that models both types of imbalance within a hyperspherical embedding geometry, introducing a dynamic geometric regularization mechanism based on the concentration parameter of the von MisesβFisher distribution. By integrating adaptive margin metric learning with hyperspherical prototype modeling, the method aligns the probability mass of the embedding distribution with its corresponding Voronoi cells, thereby mitigating representation distortion and stabilizing decision boundaries. Experimental results demonstrate that the approach significantly outperforms strong baselines across multiple public vulnerability datasets, particularly excelling under severe imbalance conditions, and yields embeddings with enhanced discriminability, interpretability, and generalization capability.
π Abstract
Software vulnerability detection is critical for ensuring software security and reliability. Despite recent advances in deep learning, real-world vulnerability datasets suffer from two severe challenges: frequency imbalance and difficulty imbalance. We reinterpret these challenges from an embedding geometry perspective, observing that such imbalances induce geometric distortions in hyperspherical representation space. To address this issue, we propose MARGIN, a metric-based framework that learns discriminative vulnerability representations through adaptive margin metric learning and hyperspherical prototype modeling. MARGIN dynamically adjusts geometric regularization according to the distribution structure estimated by the von Mises-Fisher concentration, aligning the probability mass of embedding distributions with their corresponding Voronoi cells, thereby reducing geometric distortion and yielding more stable decision boundaries. Extensive experiments on public vulnerability datasets show that MARGIN consistently outperforms strong baselines, achieving notable improvements in classification and detection, especially on challenging, imbalanced datasets. Further analysis demonstrates that MARGIN produces more structured embedding geometries, improving robustness, interpretability, and generalization.