ASTRAL: Automated Safety Testing of Large Language Models

📅 2025-01-28

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

To address the lack of automation, comprehensiveness, and timeliness in large language model (LLM) security testing, this paper proposes the first security-category-oriented black-box coverage criterion and an end-to-end automated security testing framework. The framework integrates retrieval-augmented generation (RAG), few-shot prompting, and real-time web browsing to dynamically generate diverse, balanced, and up-to-date unsafe prompts. It innovatively adopts the LLM-as-oracle paradigm for safety evaluation, with empirical results demonstrating that GPT-3.5 outperforms both GPT-4 and LlamaGuard in safety detection tasks. Experiments show that, compared to static datasets, the framework achieves nearly a twofold increase in unsafe behavior detection rate under equivalent test scale; significantly enhances identification of emerging risks—such as novel harmful topics; and has been systematically validated across mainstream LLMs for comprehensive security assessment.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have recently gained attention due to their ability to understand and generate sophisticated human-like content. However, ensuring their safety is paramount as they might provide harmful and unsafe responses. Existing LLM testing frameworks address various safety-related concerns (e.g., drugs, terrorism, animal abuse) but often face challenges due to unbalanced and obsolete datasets. In this paper, we present ASTRAL, a tool that automates the generation and execution of test cases (i.e., prompts) for testing the safety of LLMs. First, we introduce a novel black-box coverage criterion to generate balanced and diverse unsafe test inputs across a diverse set of safety categories as well as linguistic writing characteristics (i.e., different style and persuasive writing techniques). Second, we propose an LLM-based approach that leverages Retrieval Augmented Generation (RAG), few-shot prompting strategies and web browsing to generate up-to-date test inputs. Lastly, similar to current LLM test automation techniques, we leverage LLMs as test oracles to distinguish between safe and unsafe test outputs, allowing a fully automated testing approach. We conduct an extensive evaluation on well-known LLMs, revealing the following key findings: i) GPT3.5 outperforms other LLMs when acting as the test oracle, accurately detecting unsafe responses, and even surpassing more recent LLMs (e.g., GPT-4), as well as LLMs that are specifically tailored to detect unsafe LLM outputs (e.g., LlamaGuard); ii) the results confirm that our approach can uncover nearly twice as many unsafe LLM behaviors with the same number of test inputs compared to currently used static datasets; and iii) our black-box coverage criterion combined with web browsing can effectively guide the LLM on generating up-to-date unsafe test inputs, significantly increasing the number of unsafe LLM behaviors.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Automated Safety Checking

Harm Mitigation

Innovation

Methods, ideas, or system contributions that make the work stand out.

ASTRAL

Automated Security Testing

Large Language Models

🔎 Similar Papers

S-Eval: Towards Automated and Comprehensive Safety Evaluation for Large Language Models