ATLAS: Automated Tree-based Language Analysis System for C and C++ source programs

📅 2025-12-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Static analysis of C/C++ programs faces significant challenges due to pointer aliasing, multi-level indirection, function pointers, and type ambiguity induced by `typedef`. To address these, this paper proposes an end-to-end, compiler-agnostic, interprocedural, type-aware static analysis system. Methodologically, it leverages Clang LibTooling to construct a unified multi-view intermediate representation—integrating AST, CFG, and DFG—and introduces custom CFG/DFG construction algorithms alongside an alias-aware type inference module, enabling the first interprocedural, type-sensitive data-flow modeling. Contributions include: (1) statement-level control-flow and type-aware data-flow graph generation for uncompiled code; (2) empirical validation on real-world open-source projects demonstrating high-fidelity modeling of complex semantic dependencies; and (3) provision of interpretable, structurally enriched input representations for downstream software engineering tasks such as vulnerability detection and code completion.

Technology Category

Application Category

📝 Abstract
The growing complexity of modern software systems has highlighted the shortcomings of traditional programming analysis techniques, particularly for Software Engineering (SE) tasks. While machine learning and Large Language Models (LLMs) offer promising solutions, their effectiveness is limited by the way they interpret data. Unlike natural language, source code meaning is defined less by token adjacency and more by complex, long-range, and structural relationships and dependencies. This limitation is especially pronounced for C and C++, where flatter syntactic hierarchies, pointer aliasing, multi-level indirection, typedef-based type obfuscation, and function-pointer calls hinder accurate static analysis. To address these challenges, this paper introduces ATLAS, a Python-based Command-Line Interface (CLI) that (i) generates statement-level Control Flow Graphs (CFG) and type-aware Data Flow Graphs (DFG) that capture inter-functional dependencies for the entire program; (ii) has the ability to work on entire C and C++ projects comprising multiple files; (iii) works on both compilable and non-compilable code and (iv) produces a unified multi-view code representation using Abstract Syntax Trees (AST), CFG and DFG. By preserving essential structural and semantic information, ATLAS provides a practical foundation for improving downstream SE and machine-learning-based program understanding. Video demonstration: https://youtu.be/RACWQe5ELwY Tool repository: https://github.com/jaid-monwar/ATLAS-code-representation-tool
Problem

Research questions and friction points this paper is trying to address.

Generates control and data flow graphs for C/C++ programs
Handles compilable and non-compilable multi-file C/C++ projects
Produces unified multi-view code representations for program analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates control and data flow graphs for program dependencies
Works on multi-file C/C++ projects with compilable or non-compilable code
Produces unified code representation using AST, CFG, and DFG
🔎 Similar Papers
No similar papers found.
J
Jaid Monwar Chowdhury
Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, Bangladesh
A
Ahmad Farhan Shahriar Chowdhury
Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, Bangladesh
H
Humayra Binte Monwar
Electrical and Electronic Engineering, Bangladesh University of Engineering and Technology, Dhaka, Bangladesh
Mahmuda Naznin
Mahmuda Naznin
Professor, Computer Science and Engineering, Bangladesh University of Engineering and Technology
Machine LearningImage and SpeechNetwork VirtualizationGraphWSN