Learning Generalizable Multimodal Representations for Software Vulnerability Detection

📅 2026-04-28

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

This work addresses the limitations of existing vulnerability detection methods, which predominantly rely on a single code modality and overlook developer intent embedded in code comments, thereby constraining generalization in complex logical scenarios. To overcome this, the authors propose MultiVul, a novel framework that jointly models source code and natural language comments as complementary modalities for the first time. MultiVul enhances representation robustness through multimodal contrastive learning, dual-similarity alignment, and consistency regularization, and is fine-tuned on multiple large language models, including DeepSeek-Coder and Qwen2.5-Coder. Evaluated on the DiverseVul and Devign benchmarks, MultiVul achieves up to a 27.07% F1-score improvement over prompt engineering baselines and a 13.37% gain over code-only fine-tuning, while maintaining comparable inference efficiency—demonstrating a significant breakthrough beyond the performance ceiling of unimodal approaches.

📝 Abstract

Source code and its accompanying comments are complementary yet naturally aligned modalities-code encodes structural logic while comments capture developer intent. However, existing vulnerability detection methods mostly rely on single-modality code representations, overlooking the complementary semantic information embedded in comments and thus limiting their generalization across complex code structures and logical relationships. To address this, we propose MultiVul, a multimodal contrastive framework that aligns code and comment representations through dual similarity learning and consistency regularization, augmented with diverse code-text pairs to improve robustness. Experiments on widely adopted DiverseVul and Devign datasets across four large language models (LLMs) (i.e., DeepSeek-Coder-6.7B, Qwen2.5-Coder-7B, StarCoder2-7B, and CodeLlama-7B) show that MultiVul achieves up to 27.07% F1 improvement over prompting-based methods and 13.37% over code-only Fine-Tuning, while maintaining comparable inference efficiency.

Problem

Research questions and friction points this paper is trying to address.

software vulnerability detection

multimodal representation

code and comments

generalization

semantic information

Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal representation

contrastive learning

code-comment alignment