🤖 AI Summary
This work addresses the limitations of existing red-teaming approaches for large language models, which often rely on static heuristics or random search and struggle to circumvent advanced alignment mechanisms. The authors formulate jailbreaking attacks as an adversarial partially observable Markov decision process (POMDP) and introduce a metacognition-based closed-loop reasoning framework. This framework enables self-evolving strategies to diagnose the target model’s defense logic and leverages causal analysis together with semantic gradient guidance to optimize attack trajectories in a directed and interpretable manner. Evaluated across ten mainstream models, the method achieves an average attack success rate of 89.2%, with 76.0% on O1 and 78.0% on GPT-5-chat—state-of-the-art models—while reducing token consumption by 8.2–11.4× compared to baseline approaches.
📝 Abstract
Red teaming is critical for uncovering vulnerabilities in Large Language Models (LLMs). While automated methods have improved scalability, existing approaches often rely on static heuristics or stochastic search, rendering them brittle against advanced safety alignment. To address this, we introduce Metis, a framework that reformulates jailbreaking as inference-time policy optimization within an adversarial Partially Observable Markov Decision Process (POMDP). Metis employs a self-evolving metacognitive loop to perform causal diagnosis of a target's defense logic and leverages structured feedback as a semantic gradient to refine its policy, offering enhanced interpretability through transparent reasoning traces. Extensive evaluations across 10 diverse models demonstrate that Metis achieves the strongest average Attack Success Rate (ASR) among compared methods at 89.2%, maintaining high efficacy on resilient frontier models (e.g., 76.0% on O1 and 78.0% on GPT-5-chat) where traditional baselines exhibit substantial performance degradation. By replacing redundant exploration with directed optimization, Metis reduces token costs by an average of 8.2x and up to 11.4x. Our analysis reveals that current defenses remain vulnerable to internally-steered, closed-loop reasoning trajectories under the tested settings, highlighting a critical need for next-generation defenses capable of reasoning about safety dynamically during inference.