An Empirical Study of Large Language Models for Type and Call Graph Analysis

📅 2024-10-01
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) are increasingly applied to program analysis tasks such as type inference and call graph construction, yet their capabilities across languages and tasks remain poorly understood and lack rigorous, cross-language benchmarks. Method: We systematically evaluate 24 LLMs—including GPT-4o and mistral-large-it—on Python and JavaScript type inference and call graph analysis, using our newly proposed multilingual benchmark SWARM-CG/SWARM-JS—the first to support cross-language call graph evaluation—and an expanded TypeEvalPy with 77K automatically labeled samples. Contribution/Results: Experiments show LLMs significantly outperform HeaderGen and HiTyper on Python type inference but still lag behind static analyzers like PyCG in call graph generation. In JavaScript, TAJS underperforms due to limited support for modern features, while LLMs exhibit critical bottlenecks in completeness and reliability for call graph construction. This work provides the first empirical characterization of performance divergence across these two fundamental program analysis tasks, establishing foundational benchmarks, methodologies, and insights for LLM-driven program analysis.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) are increasingly being explored for their potential in software engineering, particularly in static analysis tasks. In this study, we investigate the potential of current LLMs to enhance call-graph analysis and type inference for Python and JavaScript programs. We empirically evaluated 24 LLMs, including OpenAI's GPT series and open-source models like LLaMA and Mistral, using existing and newly developed benchmarks. Specifically, we enhanced TypeEvalPy, a micro-benchmarking framework for type inference in Python, with auto-generation capabilities, expanding its scope from 860 to 77,268 type annotations for Python. Additionally, we introduced SWARM-CG and SWARM-JS, comprehensive benchmarking suites for evaluating call-graph construction tools across multiple programming languages. Our findings reveal a contrasting performance of LLMs in static analysis tasks. For call-graph generation in Python, traditional static analysis tools like PyCG significantly outperform LLMs. In JavaScript, the static tool TAJS underperforms due to its inability to handle modern language features, while LLMs, despite showing potential with models like mistral-large-it-2407-123b and GPT-4o, struggle with completeness and soundness in both languages for call-graph analysis. Conversely, LLMs demonstrate a clear advantage in type inference for Python, surpassing traditional tools like HeaderGen and hybrid approaches such as HiTyper. These results suggest that while LLMs hold promise in type inference, their limitations in call-graph analysis highlight the need for further research. Our study provides a foundation for integrating LLMs into static analysis workflows, offering insights into their strengths and current limitations.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs for call-graph analysis in Python and JavaScript
Assessing LLMs' performance in type inference for Python programs
Comparing LLMs with traditional static analysis tools
Innovation

Methods, ideas, or system contributions that make the work stand out.

Enhanced TypeEvalPy with auto-generation for Python
Introduced SWARM-CG and SWARM-JS benchmarking suites
Evaluated 24 LLMs for call-graph and type analysis
🔎 Similar Papers