🤖 AI Summary
This work addresses the robustness deficiency of LLM-driven NLP software—its susceptibility to erroneous outputs under minor input perturbations—and the limitations of existing testing methods, including low effectiveness and poor naturalness of generated test cases. We propose the first automated robustness testing framework that jointly models prompts and in-context examples. Methodologically, we design a Best-First Search–based combinatorial optimization search mechanism and integrate an adaptive naturalness control strategy to support evaluation across multiple threat models. Experiments on Llama2-13b show our approach achieves a 98.064% attack success rate—over 7× higher than StressTest—while generating more natural, cross-task transferable test cases with fewer token-level modifications and significantly improved testing efficiency. Our core contributions are (1) a unified robustness testing paradigm for prompt+example co-modeling, and (2) a scalable, adaptive search architecture.
📝 Abstract
Owing to the exceptional performance of Large Language Models (LLMs) in Natural Language Processing (NLP) tasks, LLM-based NLP software has rapidly gained traction across various domains, such as financial analysis and content moderation. However, these applications frequently exhibit robustness deficiencies, where slight perturbations in input (prompt+example) may lead to erroneous outputs. Current robustness testing methods face two main limitations: (1) low testing effectiveness, limiting the applicability of LLM-based software in safety-critical scenarios, and (2) insufficient naturalness of test cases, reducing the practical value of testing outcomes. To address these issues, this paper proposes ABFS, a straightforward yet effective automated testing method that, for the first time, treats the input prompts and examples as a unified whole for robustness testing. Specifically, ABFS formulates the testing process as a combinatorial optimization problem, employing Best-First Search to identify successful test cases within the perturbation space and designing a novel Adaptive control strategy to enhance test case naturalness. We evaluate the robustness testing performance of ABFS on three datasets across five threat models. On Llama2-13b, the traditional StressTest achieves only a 13.273% success rate, while ABFS attains a success rate of 98.064%, supporting a more comprehensive robustness assessment before software deployment. Compared to baseline methods, ABFS introduces fewer modifications to the original input and consistently generates test cases with superior naturalness. Furthermore, test cases generated by ABFS exhibit stronger transferability and higher testing efficiency, significantly reducing testing costs.