🤖 AI Summary
This work addresses the challenges of deploying automated software analysis tools in open-source projects, which often involve complex environment setup, dependency resolution, and tool adaptation. To systematically evaluate such tasks, the authors introduce AnalysisBench—the first benchmark specifically designed for automated software analysis—comprising 35 tool-project pairs. They further develop a customized AnalysisAgent that integrates four agent architectures with four large language model (LLM) backends, incorporating both manual verification and LLM-based self-validation mechanisms. Experimental results demonstrate that agent architecture design is more critical than underlying model capability; notably, AnalysisAgent achieves a 94% manual validation success rate (33 out of 35) when powered by Gemini-1.5-Flash, substantially outperforming baseline approaches. The study also reveals that Java toolchains and whole-program analysis tasks, particularly symbolic execution, present heightened difficulty.
📝 Abstract
Numerous software analysis tools exist today, yet applying them to diverse open-source projects remains challenging due to environment setup, dependency resolution, and tool configuration. LLM-based agents offer a potential solution, yet no prior work has systematically studied their effectiveness on the specific task of automated software analysis, which, unlike issue solving or general environment setup, requires installing and configuring a separate analysis tool alongside the target project, generating tool-specific prerequisites, and validating that the tool produces meaningful analysis outputs. We introduce AnalysisBench, a benchmark of 35 tool-project pairs spanning seven analysis tools and ten diverse C/C++ and Java projects, each with a manually constructed reference setup. Using AnalysisBench, we evaluate four agent architectures across four LLM backends. Our custom agent, AnalysisAgent, achieves manually verified success rates of 94% (Gemini-3-Flash, 33/35 tasks), compared to 77% for the best baseline (ExecutionAgent). Beyond quantitative results, we identify key limitations in existing agents, including stage mixing, poor error localization, and premature termination, and show that agentic architecture matters more than LLM capability alone. We further find that whole-program analyses and symbolic execution are the most difficult tasks, that Java toolchains pose greater challenges than C/C++, and that LLM-self-validated success consistently overstates manually verified success.