🤖 AI Summary
This work addresses emerging security risks introduced by integrating large language models (LLMs) into GitHub CI workflows, where adversarial inputs can manipulate LLM prompts and outputs to induce erroneous security decisions or trigger privileged operations. The study presents the first systematic investigation of this threat landscape, proposing an end-to-end threat modeling methodology and a comprehensive LLM-CI risk taxonomy. To automate threat detection, the authors design Heimdallr, a hybrid analysis framework that combines static program analysis with LLM-based semantic understanding. Heimdallr leverages LLM-workflow property graphs, triggerability analysis, LLM-assisted data-flow summarization, and deterministic propagation to identify exploitable threat vectors. Evaluated on 300 annotated workflows, the approach achieves an F1 score of 0.994 for LLM node identification, 99.8% accuracy in triggerability classification, and a micro-averaged F1 of 0.917 for threat detection. The study has disclosed 802 vulnerabilities across 759 repositories, with 71 acknowledgments from affected projects.
📝 Abstract
GitHub Continuous Integration (CI) workflows increasingly integrate Large Language Models (LLMs) to automate review, triage, content generation, and repository maintenance. This creates a new attack surface: externally controllable workflow inputs can shape LLM prompts and outputs, which may in turn affect security decisions, repository state, or privileged execution. Although LLM security and CI security have each been studied extensively, their intersection remains underexplored. In this paper, we present the first study of LLM-induced security risks in GitHub CI workflows. We characterize the problem along the full execution chain and develop a taxonomy of high-level risk classes and concrete threat vectors. To detect such risks in practice, we design Heimdallr, a hybrid analysis framework that normalizes workflows into an LLM-Workflow Property Graph (L-WPG) and combines triggerability analysis, LLM-assisted dataflow summarization, and deterministic propagation to synthesize concrete threat-vector findings. Evaluated on 300 manually annotated unique workflows, Heimdallr achieves high accuracy on LLM-node identification (F1~=~0.994), triggerability classification (99.8%), and threat-vector detection (micro-average F1~=~0.917). As part of an ongoing detection and disclosure effort, we have so far responsibly disclosed 802 vulnerable workflow instances across 759 repositories and received 71 acknowledgments.