🤖 AI Summary
To address the high testing cost, redundant test cases, and low coverage in prompt template evaluation for LLM applications, this paper introduces Adaptive Random Testing (ART) to prompt engineering for the first time. We propose a dynamic test input selection method guided by output diversity and human feedback. Our approach integrates Levenshtein distance and BERT-based cosine similarity to construct a multi-granularity output dissimilarity metric, and designs a dynamic test suite scoring and resampling mechanism. Evaluated across multiple real-world LLM application scenarios, the method significantly reduces the testing budget required for failure detection: output diversity increases by 27%, and fault detection efficiency improves by up to 42%. This work establishes a scalable, low-cost, automated testing paradigm for validating prompt robustness.
📝 Abstract
The recent surge of building software systems powered by Large Language Models (LLMs) has led to the development of various testing frameworks, primarily focused on treating prompt templates as the unit of testing. Despite the significant costs associated with test input execution and output assessment, the curation of optimized test suites is yet overlooked in these tools, which calls for tailored test selection or prioritization strategies. In this paper, we show that diversity-based testing techniques, such as Adaptive Random Testing (ART) with appropriate string distance metrics, can be effectively applied to the testing of prompt templates. Our proposed adaptive testing approach adjusts the conventional ART process to this context by selecting new test inputs based on scores derived from existing test suite and their labelling results. Our results, obtained using various implementations that explore several string-based distances, confirm that our approach enables the discovery of failures with reduced testing budgets and promotes the generation of more varied outputs.