🤖 AI Summary
Existing LLM-driven NLP software lacks dedicated robustness testing methodologies. Method: We propose AORTA, the first automated robustness testing framework tailored for LLMs, which formulates testing as a combinatorial optimization problem, enables transfer of DNN testing techniques, and introduces Adaptive Beam Search (ABS)—a novel algorithm that dynamically adjusts beam width and backtracking strategies to enhance coverage and naturalness in ultra-high-dimensional feature spaces. AORTA integrates 18 test generation methods, multiple threat models, query-efficiency optimizations, and naturalness evaluation. Results: Evaluated across three datasets and five threat models, AORTA achieves an average test success rate of 86.14%. Compared to PWWS, it reduces per-success test time by 3441.9 seconds and query count by 218.8×, while generating more natural and cross-model transferable adversarial examples.
📝 Abstract
Benefiting from the advancements in LLMs, NLP software has undergone rapid development. Such software is widely employed in various safety-critical tasks, such as financial sentiment analysis, toxic content moderation, and log generation. To our knowledge, there are no known automated robustness testing methods specifically designed for LLM-based NLP software. Given the complexity of LLMs and the unpredictability of real-world inputs (including prompts and examples), it is essential to examine the robustness of overall inputs to ensure the safety of such software. To this end, this paper introduces the first AutOmated Robustness Testing frAmework, AORTA, which reconceptualizes the testing process into a combinatorial optimization problem. Existing testing methods designed for DNN-based software can be applied to LLM-based software by AORTA, but their effectiveness is limited. To address this, we propose a novel testing method for LLM-based software within AORTA called Adaptive Beam Search. ABS is tailored for the expansive feature space of LLMs and improves testing effectiveness through an adaptive beam width and the capability for backtracking. We successfully embed 18 test methods in the designed framework AORTA and compared the test validity of ABS with three datasets and five threat models. ABS facilitates a more comprehensive and accurate robustness assessment before software deployment, with an average test success rate of 86.138%. Compared to the currently best-performing baseline PWWS, ABS significantly reduces the computational overhead by up to 3441.895 seconds per successful test case and decreases the number of queries by 218.762 times on average. Furthermore, test cases generated by ABS exhibit greater naturalness and transferability.