Adaptive Testing for LLM-Based Applications: A Diversity-based Approach

📅 2025-01-23

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

To address the high testing cost, redundant test cases, and low coverage in prompt template evaluation for LLM applications, this paper introduces Adaptive Random Testing (ART) to prompt engineering for the first time. We propose a dynamic test input selection method guided by output diversity and human feedback. Our approach integrates Levenshtein distance and BERT-based cosine similarity to construct a multi-granularity output dissimilarity metric, and designs a dynamic test suite scoring and resampling mechanism. Evaluated across multiple real-world LLM application scenarios, the method significantly reduces the testing budget required for failure detection: output diversity increases by 27%, and fault detection efficiency improves by up to 42%. This work establishes a scalable, low-cost, automated testing paradigm for validating prompt robustness.

Technology Category

Application Category

📝 Abstract

The recent surge of building software systems powered by Large Language Models (LLMs) has led to the development of various testing frameworks, primarily focused on treating prompt templates as the unit of testing. Despite the significant costs associated with test input execution and output assessment, the curation of optimized test suites is yet overlooked in these tools, which calls for tailored test selection or prioritization strategies. In this paper, we show that diversity-based testing techniques, such as Adaptive Random Testing (ART) with appropriate string distance metrics, can be effectively applied to the testing of prompt templates. Our proposed adaptive testing approach adjusts the conventional ART process to this context by selecting new test inputs based on scores derived from existing test suite and their labelling results. Our results, obtained using various implementations that explore several string-based distances, confirm that our approach enables the discovery of failures with reduced testing budgets and promotes the generation of more varied outputs.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Testing Efficiency

Prompt Templates Optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive Random Testing

Large Language Model

Prompt Template Efficiency

🔎 Similar Papers

Active Testing of Large Language Model via Multi-Stage Sampling