π€ AI Summary
To address the lack of automation, comprehensiveness, and timeliness in large language model (LLM) security testing, this paper proposes the first security-category-oriented black-box coverage criterion and an end-to-end automated security testing framework. The framework integrates retrieval-augmented generation (RAG), few-shot prompting, and real-time web browsing to dynamically generate diverse, balanced, and up-to-date unsafe prompts. It innovatively adopts the LLM-as-oracle paradigm for safety evaluation, with empirical results demonstrating that GPT-3.5 outperforms both GPT-4 and LlamaGuard in safety detection tasks. Experiments show that, compared to static datasets, the framework achieves nearly a twofold increase in unsafe behavior detection rate under equivalent test scale; significantly enhances identification of emerging risksβsuch as novel harmful topics; and has been systematically validated across mainstream LLMs for comprehensive security assessment.
π Abstract
Large Language Models (LLMs) have recently gained attention due to their ability to understand and generate sophisticated human-like content. However, ensuring their safety is paramount as they might provide harmful and unsafe responses. Existing LLM testing frameworks address various safety-related concerns (e.g., drugs, terrorism, animal abuse) but often face challenges due to unbalanced and obsolete datasets. In this paper, we present ASTRAL, a tool that automates the generation and execution of test cases (i.e., prompts) for testing the safety of LLMs. First, we introduce a novel black-box coverage criterion to generate balanced and diverse unsafe test inputs across a diverse set of safety categories as well as linguistic writing characteristics (i.e., different style and persuasive writing techniques). Second, we propose an LLM-based approach that leverages Retrieval Augmented Generation (RAG), few-shot prompting strategies and web browsing to generate up-to-date test inputs. Lastly, similar to current LLM test automation techniques, we leverage LLMs as test oracles to distinguish between safe and unsafe test outputs, allowing a fully automated testing approach. We conduct an extensive evaluation on well-known LLMs, revealing the following key findings: i) GPT3.5 outperforms other LLMs when acting as the test oracle, accurately detecting unsafe responses, and even surpassing more recent LLMs (e.g., GPT-4), as well as LLMs that are specifically tailored to detect unsafe LLM outputs (e.g., LlamaGuard); ii) the results confirm that our approach can uncover nearly twice as many unsafe LLM behaviors with the same number of test inputs compared to currently used static datasets; and iii) our black-box coverage criterion combined with web browsing can effectively guide the LLM on generating up-to-date unsafe test inputs, significantly increasing the number of unsafe LLM behaviors.