🤖 AI Summary
Taint analysis faces two key challenges: heavy reliance on manual annotation of third-party libraries and poor scalability due to the prohibitively large size of whole-program dependency graphs. This paper proposes a language-agnostic, explicit data-dependency graph construction method enabling efficient taint analysis for large-scale programs. Our contributions are threefold: (1) the first taint propagation mechanism supporting *incremental library annotation*—i.e., adding or refining library function specifications *without re-analyzing the entire program*; (2) a lightweight, cross-language intermediate representation for data dependencies; and (3) a flow-sensitive but context-insensitive over-approximation model built atop Joern, balancing precision and performance. Evaluation demonstrates substantial improvements in analysis speed and scalability, making the approach suitable for CI/CD integration. The implementation is open-sourced and integrated into Joern, advancing practical, automated vulnerability detection.
📝 Abstract
Taint analysis using explicit whole-program data-dependence graphs is powerful for vulnerability discovery but faces two major challenges. First, accurately modeling taint propagation through calls to external library procedures requires extensive manual annotations, which becomes impractical for large ecosystems. Second, the sheer size of whole-program graph representations leads to serious scalability and performance issues, particularly when quick analysis is needed in continuous development pipelines. This paper presents the design and implementation of a system for a language-agnostic data-dependence representation. The system accommodates missing annotations describing the behavior of library procedures by over-approximating data flows, allowing annotations to be added later without recalculation. We contribute this data-flow analysis system to the open-source code analysis platform Joern making it available to the community.