🤖 AI Summary
Static analysis of C/C++ programs faces significant challenges due to pointer aliasing, multi-level indirection, function pointers, and type ambiguity induced by `typedef`. To address these, this paper proposes an end-to-end, compiler-agnostic, interprocedural, type-aware static analysis system. Methodologically, it leverages Clang LibTooling to construct a unified multi-view intermediate representation—integrating AST, CFG, and DFG—and introduces custom CFG/DFG construction algorithms alongside an alias-aware type inference module, enabling the first interprocedural, type-sensitive data-flow modeling. Contributions include: (1) statement-level control-flow and type-aware data-flow graph generation for uncompiled code; (2) empirical validation on real-world open-source projects demonstrating high-fidelity modeling of complex semantic dependencies; and (3) provision of interpretable, structurally enriched input representations for downstream software engineering tasks such as vulnerability detection and code completion.
📝 Abstract
The growing complexity of modern software systems has highlighted the shortcomings of traditional programming analysis techniques, particularly for Software Engineering (SE) tasks. While machine learning and Large Language Models (LLMs) offer promising solutions, their effectiveness is limited by the way they interpret data. Unlike natural language, source code meaning is defined less by token adjacency and more by complex, long-range, and structural relationships and dependencies. This limitation is especially pronounced for C and C++, where flatter syntactic hierarchies, pointer aliasing, multi-level indirection, typedef-based type obfuscation, and function-pointer calls hinder accurate static analysis. To address these challenges, this paper introduces ATLAS, a Python-based Command-Line Interface (CLI) that (i) generates statement-level Control Flow Graphs (CFG) and type-aware Data Flow Graphs (DFG) that capture inter-functional dependencies for the entire program; (ii) has the ability to work on entire C and C++ projects comprising multiple files; (iii) works on both compilable and non-compilable code and (iv) produces a unified multi-view code representation using Abstract Syntax Trees (AST), CFG and DFG. By preserving essential structural and semantic information, ATLAS provides a practical foundation for improving downstream SE and machine-learning-based program understanding. Video demonstration: https://youtu.be/RACWQe5ELwY Tool repository: https://github.com/jaid-monwar/ATLAS-code-representation-tool