Automated Red-Teaming Framework for Large Language Model Security Assessment: A Comprehensive Attack Generation and Detection System

📅 2025-12-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Addressing the challenges of safety alignment for large language models (LLMs) in high-stakes applications and the poor scalability of manual red-teaming, this paper introduces the first meta-prompt-driven automated red-teaming framework. The framework integrates multimodal anomaly detection with structured threat modeling to enable closed-loop evaluation across six standardized threat categories—including Reward Hacking and Deceptive Alignment. It pioneers a meta-prompt-guided adversarial prompt synthesis mechanism and a novel multimodal vulnerability co-detection paradigm, uncovering 12 previously undocumented attack patterns. Evaluated on GPT-OSS-20B, the framework identifies 47 vulnerabilities—including 21 high-severity ones—with a 3.9× higher detection rate than human experts and 89% accuracy. This significantly improves reproducibility, coverage, and interpretability in AI safety testing.

Technology Category

Application Category

📝 Abstract
As large language models (LLMs) are increasingly deployed in high-stakes domains, ensuring their security and alignment has become a critical challenge. Existing red-teaming practices depend heavily on manual testing, which limits scalability and fails to comprehensively cover the vast space of potential adversarial behaviors. This paper introduces an automated red-teaming framework that systematically generates, executes, and evaluates adversarial prompts to uncover security vulnerabilities in LLMs. Our framework integrates meta-prompting-based attack synthesis, multi-modal vulnerability detection, and standardized evaluation protocols spanning six major threat categories -- reward hacking, deceptive alignment, data exfiltration, sandbagging, inappropriate tool use, and chain-of-thought manipulation. Experiments on the GPT-OSS-20B model reveal 47 distinct vulnerabilities, including 21 high-severity and 12 novel attack patterns, achieving a $3.9 imes$ improvement in vulnerability discovery rate over manual expert testing while maintaining 89% detection accuracy. These results demonstrate the framework's effectiveness in enabling scalable, systematic, and reproducible AI safety evaluations. By providing actionable insights for improving alignment robustness, this work advances the state of automated LLM red-teaming and contributes to the broader goal of building secure and trustworthy AI systems.
Problem

Research questions and friction points this paper is trying to address.

Automates adversarial prompt generation to uncover LLM security vulnerabilities
Systematically tests six major threat categories for comprehensive vulnerability assessment
Improves vulnerability discovery rate over manual testing while maintaining high accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated framework generates adversarial prompts systematically
Integrates meta-prompting attack synthesis and multi-modal detection
Standardized evaluation covers six major threat categories
🔎 Similar Papers
No similar papers found.
Zhang Wei
Zhang Wei
Professor of Physical Chemistry, Shaanxi Normal University
Artificial PhotosynthesisWater Splitting
P
Peilu Hu
Independent Researcher
S
Shengning Lang
Stevens Institute of Technology, Hoboken, NJ, USA
H
Hao Yan
Stevens Institute of Technology, Hoboken, NJ, USA
L
Li Mei
Independent Researcher
Y
Yichao Zhang
The University of Texas at Dallas, Richardson, TX, USA
C
Chen Yang
AI Safety Research Lab, Institute of Advanced Computing, Shenzhen, China
Junfeng Hao
Junfeng Hao
广东医科大学附属医院 血液透析中心 主任医师
肾病 血液透析 血透通路
Z
Zhimo Han
Zheng Zhou University of Light Industry, Zhengzhou, China