Scalable Language Agnostic Taint Tracking using Explicit Data Dependencies

📅 2025-06-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Taint analysis faces two key challenges: heavy reliance on manual annotation of third-party libraries and poor scalability due to the prohibitively large size of whole-program dependency graphs. This paper proposes a language-agnostic, explicit data-dependency graph construction method enabling efficient taint analysis for large-scale programs. Our contributions are threefold: (1) the first taint propagation mechanism supporting *incremental library annotation*—i.e., adding or refining library function specifications *without re-analyzing the entire program*; (2) a lightweight, cross-language intermediate representation for data dependencies; and (3) a flow-sensitive but context-insensitive over-approximation model built atop Joern, balancing precision and performance. Evaluation demonstrates substantial improvements in analysis speed and scalability, making the approach suitable for CI/CD integration. The implementation is open-sourced and integrated into Joern, advancing practical, automated vulnerability detection.

Technology Category

Application Category

📝 Abstract
Taint analysis using explicit whole-program data-dependence graphs is powerful for vulnerability discovery but faces two major challenges. First, accurately modeling taint propagation through calls to external library procedures requires extensive manual annotations, which becomes impractical for large ecosystems. Second, the sheer size of whole-program graph representations leads to serious scalability and performance issues, particularly when quick analysis is needed in continuous development pipelines. This paper presents the design and implementation of a system for a language-agnostic data-dependence representation. The system accommodates missing annotations describing the behavior of library procedures by over-approximating data flows, allowing annotations to be added later without recalculation. We contribute this data-flow analysis system to the open-source code analysis platform Joern making it available to the community.
Problem

Research questions and friction points this paper is trying to address.

Modeling taint propagation without extensive manual annotations
Addressing scalability issues in whole-program graph representations
Providing language-agnostic data-dependence analysis for vulnerability discovery
Innovation

Methods, ideas, or system contributions that make the work stand out.

Language-agnostic data-dependence representation system
Over-approximating data flows for missing annotations
Integration into open-source platform Joern
🔎 Similar Papers
No similar papers found.
S
Sedick David Baker Effendi
Stellenbosch University, Stellenbosch, South Africa
X
Xavier Pinho
StackGen, San Ramon, USA
A
Andrei Michael Dreyer
Whirly Labs, Cape Town, South Africa
Fabian Yamaguchi
Fabian Yamaguchi
Whirly Labs (Pty) Ltd, Qwiet.AI, Stellenbosch University