A Survey of Source Code Representations for Machine Learning-Based Cybersecurity Tasks

📅 2024-03-15

🏛️ ACM Computing Surveys

📈 Citations: 3

✨ Influential: 0

career value

174K/year

🤖 AI Summary

The selection of source code representations for machine learning–driven cybersecurity tasks lacks systematic, evidence-based guidance. Method: We conduct the first comprehensive empirical study, constructing a multidimensional evaluation framework that integrates diverse code representations—including abstract syntax trees (ASTs), tokens, control-flow graphs (CFGs), and program dependence graphs (PDGs)—with models spanning SVMs, RNNs, GNNs, and Transformers, across vulnerability detection, malware classification, and other security tasks in C, Python, and Java. Contribution/Results: We introduce a four-dimensional mapping taxonomy—“representation–task–language–model”—and quantitatively demonstrate that graph-based representations (especially ASTs) achieve superior popularity and effectiveness; C-language programs and vulnerability detection dominate current research focus; and sequence models and SVMs remain the most widely adopted. Our analysis identifies critical representation-task-language-model alignment principles and exposes key research gaps, providing both theoretical foundations and practical guidelines for security-oriented code representation design.

Technology Category

Application Category

📝 Abstract

Machine learning techniques for cybersecurity-related software engineering tasks are becoming increasingly popular. The representation of source code is a key portion of the technique that can impact the way the model is able to learn the features of the source code. With an increasing number of these techniques being developed, it is valuable to see the current state of the field to better understand what exists and what’s not there yet. This paper presents a study of these existing ML-based approaches and demonstrates what type of representations were used for different cybersecurity tasks and programming languages. Additionally, we study what types of models are used with different representations. We have found that graph-based representations are the most popular category of representation, and Tokenizers and Abstract Syntax Trees (ASTs) are the two most popular representations overall ( e.g. , AST and Tokenizers are the representations with the highest count of papers, while graph-based representations is the category with the highest count of papers). We also found that the most popular cybersecurity task is vulnerability detection, and the language that is covered by the most techniques is C. Finally, we found that sequence-based models are the most popular category of models, and Support Vector Machines (SVMs) are the most popular model overall.

Problem

Research questions and friction points this paper is trying to address.

Surveying source code representations for ML-based cybersecurity tasks

Analyzing representation impact on model learning and feature extraction

Identifying popular representations, models, tasks, and languages in current research

Innovation

Methods, ideas, or system contributions that make the work stand out.

Graph-based representations dominate cybersecurity tasks

Tokenizers and ASTs are top individual code representations

Sequence models most popular for vulnerability detection

🔎 Similar Papers

COMBO: Pre-Training Representations of Binary Code Using Contrastive Learning