🤖 AI Summary
This work addresses the limited representativeness of Java programs in the SV-COMP benchmark suite, which hampers the effective evaluation of verification tools. To bridge this gap, the authors introduce the first application of ARG-V–based automatic generation techniques, aligned with SV-COMP verification task specifications, to construct 68 structurally sound and semantically realistic Java benchmark programs. Experimental evaluation reveals a significant drop in both precision and recall for four state-of-the-art Java verifiers on this new benchmark, exposing their limitations in handling real-world software behaviors. These findings not only provide concrete evidence for guiding future tool improvements but also substantially enhance the coverage and realism of the SV-COMP Java benchmark suite for more reliable empirical assessment.
📝 Abstract
The SV-COMP competition provides a state-of-the-art platform for evaluating software verification tools on a standardized set of verification tasks. Consequently, verifier development outcomes are influenced by the composition of program benchmarks included in SV-COMP. When expanding this benchmark corpus, it is crucial to consider whether newly added programs cause verifiers to exhibit behavior distinct from that observed on existing benchmarks. Doing so helps mitigate external threats to the validity of the competition's results. In this paper, we present the application of the ARG-V tool for automatically generating Java verification benchmarks in the SV-COMP format. We demonstrate that, on a newly generated set of 68 realistic benchmarks, all four leading Java verifiers decrease in accuracy and recall compared to their performance on the existing benchmark suite. These findings highlight the potential of ARG-V to enhance the comprehensiveness and realism of verification tool evaluation, while also providing a roadmap for verifier developers aiming to improve their tools'applicability to real-world software.