🤖 AI Summary
This work addresses two key limitations in security smell detection for Infrastructure-as-Code (IaC) scripts: low detection accuracy and insufficient modeling of long-range contextual dependencies. We propose the first cross-modal static analysis method that jointly leverages natural language and code semantics. Our approach innovatively integrates CodeBERT—fine-tuned for code–text semantic alignment—with LongFormer—to capture long-sequence dependencies—enabling fine-grained identification of security misconfigurations in Ansible and Puppet scripts. Evaluated on real-world datasets, our method achieves precision/recall of 0.92/0.88 for Ansible (+46% precision, +9% recall) and 0.87/0.75 for Puppet (+32% precision, −22% recall), substantially outperforming state-of-the-art tools. Ablation studies and comparisons with large language models confirm the effectiveness and generalization advantage of our dual-encoder architecture.
📝 Abstract
Infrastructure as Code (IaC) automates the provisioning and management of IT infrastructure through scripts and tools, streamlining software deployment. Prior studies have shown that IaC scripts often contain recurring security misconfigurations, and several detection and mitigation approaches have been proposed. Most of these rely on static analysis, using statistical code representations or Machine Learning (ML) classifiers to distinguish insecure configurations from safe code.
In this work, we introduce a novel approach that enhances static analysis with semantic understanding by jointly leveraging natural language and code representations. Our method builds on two complementary ML models: CodeBERT, to capture semantics across code and text, and LongFormer, to represent long IaC scripts without losing contextual information. We evaluate our approach on misconfiguration datasets from two widely used IaC tools, Ansible and Puppet. To validate its effectiveness, we conduct two ablation studies (removing code text from the natural language input and truncating scripts to reduce context) and compare against four large language models (LLMs) and prior work. Results show that semantic enrichment substantially improves detection, raising precision and recall from 0.46 and 0.79 to 0.92 and 0.88 on Ansible, and from 0.55 and 0.97 to 0.87 and 0.75 on Puppet, respectively.