RACA: Representation-Aware Coverage Criteria for LLM Safety Testing

📅 2026-02-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of systematic coverage criteria in safety testing for large language models (LLMs), which hinders the assessment of how thoroughly test cases cover risks related to harmful content generation. To this end, the authors propose RACA, a novel framework that introduces representation-aware coverage criteria into LLM safety evaluation. RACA leverages representation engineering to identify safety-critical concepts and integrates expert-calibrated safety representations with concept activation scoring to define six coverage criteria—spanning both individual and composite concepts. Experimental results demonstrate that RACA effectively identifies high-quality jailbreaking prompts and outperforms existing methods in test suite prioritization and adversarial prompt sampling, while exhibiting strong generalization and robustness across diverse settings.

Technology Category

Application Category

📝 Abstract
Recent advancements in LLMs have led to significant breakthroughs in various AI applications. However, their sophisticated capabilities also introduce severe safety concerns, particularly the generation of harmful content through jailbreak attacks. Current safety testing for LLMs often relies on static datasets and lacks systematic criteria to evaluate the quality and adequacy of these tests. While coverage criteria have been effective for smaller neural networks, they are not directly applicable to LLMs due to scalability issues and differing objectives. To address these challenges, this paper introduces RACA, a novel set of coverage criteria specifically designed for LLM safety testing. RACA leverages representation engineering to focus on safety-critical concepts within LLMs, thereby reducing dimensionality and filtering out irrelevant information. The framework operates in three stages: first, it identifies safety-critical representations using a small, expert-curated calibration set of jailbreak prompts. Second, it calculates conceptual activation scores for a given test suite based on these representations. Finally, it computes coverage results using six sub-criteria that assess both individual and compositional safety concepts. We conduct comprehensive experiments to validate RACA's effectiveness, applicability, and generalization, where the results demonstrate that RACA successfully identifies high-quality jailbreak prompts and is superior to traditional neuron-level criteria. We also showcase its practical application in real-world scenarios, such as test set prioritization and attack prompt sampling. Furthermore, our findings confirm RACA's generalization to various scenarios and its robustness across various configurations. Overall, RACA provides a new framework for evaluating the safety of LLMs, contributing a valuable technique to the field of testing for AI.
Problem

Research questions and friction points this paper is trying to address.

LLM safety testing
coverage criteria
jailbreak attacks
representation engineering
test adequacy
Innovation

Methods, ideas, or system contributions that make the work stand out.

representation engineering
coverage criteria
LLM safety testing
jailbreak detection
conceptual activation
🔎 Similar Papers
No similar papers found.