🤖 AI Summary
This work investigates whether large language models (LLMs) can perform formal program analysis via abstract interpretation—a foundational technique in static analysis requiring precise semantic modeling and iterative convergence.
Method: We introduce abstract interpretation theory into LLM-based reasoning for the first time, proposing two structured prompting strategies: (i) compositional prompting to encode semantic rules, and (ii) fixed-point equation prompting to simulate iterative convergence. Our approach relies solely on prompt engineering—no fine-tuning is involved—and is evaluated on 22 verification tasks from the SV-COMP 2019 benchmark.
Contribution/Results: We establish the first prompt-based framework for abstract interpretation with LLMs, enabling symbolic, interpretable, and reproducible static analysis. Experiments show that models such as GPT-4 can approximate abstract interpretation behavior, yet reveal persistent logical errors and hallucinations—particularly in inferring loop invariants and handling complex control flow. The framework advances the frontier of LLM-driven formal methods by bridging high-level reasoning with rigorous semantics.
📝 Abstract
LLMs have demonstrated impressive capabilities in code generation and comprehension, but their potential in being able to perform program analysis in a formal, automatic manner remains under-explored. To that end, we systematically investigate whether LLMs can reason about programs using a program analysis framework called abstract interpretation. We prompt LLMs to follow two different strategies, denoted as Compositional and Fixed Point Equation, to formally reason in the style of abstract interpretation, which has never been done before to the best of our knowledge. We validate our approach using state-of-the-art LLMs on 22 challenging benchmark programs from the Software Verification Competition (SV-COMP) 2019 dataset, widely used in program analysis. Our results show that our strategies are able to elicit abstract interpretation-based reasoning in the tested models, but LLMs are susceptible to logical errors, especially while interpreting complex program structures, as well as general hallucinations. This highlights key areas for improvement in the formal reasoning capabilities of LLMs.