Scalable Language Agnostic Taint Tracking using Explicit Data Dependencies

📅 2025-06-06

📈 Citations: 0

✨ Influential: 0

career value

144K/year

🤖 AI Summary

Taint analysis faces two key challenges: heavy reliance on manual annotation of third-party libraries and poor scalability due to the prohibitively large size of whole-program dependency graphs. This paper proposes a language-agnostic, explicit data-dependency graph construction method enabling efficient taint analysis for large-scale programs. Our contributions are threefold: (1) the first taint propagation mechanism supporting *incremental library annotation*—i.e., adding or refining library function specifications *without re-analyzing the entire program*; (2) a lightweight, cross-language intermediate representation for data dependencies; and (3) a flow-sensitive but context-insensitive over-approximation model built atop Joern, balancing precision and performance. Evaluation demonstrates substantial improvements in analysis speed and scalability, making the approach suitable for CI/CD integration. The implementation is open-sourced and integrated into Joern, advancing practical, automated vulnerability detection.

Technology Category

Application Category

📝 Abstract

Taint analysis using explicit whole-program data-dependence graphs is powerful for vulnerability discovery but faces two major challenges. First, accurately modeling taint propagation through calls to external library procedures requires extensive manual annotations, which becomes impractical for large ecosystems. Second, the sheer size of whole-program graph representations leads to serious scalability and performance issues, particularly when quick analysis is needed in continuous development pipelines. This paper presents the design and implementation of a system for a language-agnostic data-dependence representation. The system accommodates missing annotations describing the behavior of library procedures by over-approximating data flows, allowing annotations to be added later without recalculation. We contribute this data-flow analysis system to the open-source code analysis platform Joern making it available to the community.

Problem

Research questions and friction points this paper is trying to address.

Modeling taint propagation without extensive manual annotations

Addressing scalability issues in whole-program graph representations

Providing language-agnostic data-dependence analysis for vulnerability discovery

Innovation

Methods, ideas, or system contributions that make the work stand out.

Language-agnostic data-dependence representation system

Over-approximating data flows for missing annotations

Integration into open-source platform Joern

🔎 Similar Papers

A Comprehensive Survey of Contamination Detection Methods in Large Language Models