Can Language Models Pass Software Testing Certification Exams? a case study

📅 2026-03-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates whether large language models possess the conceptual understanding and knowledge application capabilities required to pass professional software testing certifications. We introduce the first benchmark for evaluating language models on software testing proficiency by systematically assessing 60 commercial and open-source multimodal large language models on the ISTQB certification exam, which comprises 1,171 questions spanning foundational to expert levels. To probe deeper conceptual understanding, we employ context-preserving metamorphic transformations. Results show that two commercial models achieve pass rates exceeding 65% across all 30 exam variants, substantially outperforming open-source counterparts. Beyond establishing this novel evaluation framework, our systematic error analysis yields actionable insights for both certification exam design and model improvement.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) play a pivotal role in both academic research and broader societal applications. LLMs are increasingly used in software testing activities such as test case generation, selection, and repair. However, several important questions remain: (1) do LLMs possess enough information about software testing principles to perform software testing tasks effectively? (2) do LLMs possess sufficient conceptual understanding of software testing to answer software testing questions under metamorphic transformations? and (3) do certain properties of software testing questions influence the performance of LLMs? To answer these questions, this study evaluates 60 multimodal language models from both commercial vendors and the open-source community. The evaluation is performed using 30 sample exams of different types (core foundation, core advanced, specialist, and expert) from the International Software Testing Qualifications Board (ISTQB), which are used to assess the competence of human testers. In total, each model is evaluated on 1,171 questions. Furthermore, to ensure sufficient conceptual understanding, the models are also tested on exam questions transformed using context-preserving metamorphic techniques. Two models passed all the certifications by scoring at least 65% in all of the 30 certification exams, with commercial models generally outperforming open-source ones. We analyze the reasons behind incorrect answers and provide recommendations for improving the design of software testing certification exams.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Software Testing
Certification Exams
Conceptual Understanding
Metamorphic Transformations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Language Models
Software Testing Certification
Metamorphic Testing
ISTQB
Conceptual Understanding