Effective Red-Teaming of Policy-Adherent Agents

📅 2025-06-11

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This paper addresses the dual challenges of regulatory compliance and conversational naturalness faced by task-oriented LLM agents in highly policy-constrained domains (e.g., customer service). To tackle this, we first introduce a novel threat model centered on the underlying motivations for policy violations. Second, we propose CRAFT—a multi-agent red-teaming framework that integrates policy-aware persuasive strategies with adversarial dialogue generation to systematically probe and expose manipulative behaviors. Third, we release tau-break, the first benchmark specifically designed to evaluate manipulative jailbreaking attacks. Experiments demonstrate that CRAFT significantly outperforms conventional jailbreaking methods—including DAN and emotion-based manipulation—while existing defense mechanisms exhibit insufficient robustness against such attacks. These findings underscore the critical need for deep policy alignment and proactive, semantics-aware safeguarding mechanisms in production-grade LLM agents.

Technology Category

Application Category

📝 Abstract

Task-oriented LLM-based agents are increasingly used in domains with strict policies, such as refund eligibility or cancellation rules. The challenge lies in ensuring that the agent consistently adheres to these rules and policies, appropriately refusing any request that would violate them, while still maintaining a helpful and natural interaction. This calls for the development of tailored design and evaluation methodologies to ensure agent resilience against malicious user behavior. We propose a novel threat model that focuses on adversarial users aiming to exploit policy-adherent agents for personal benefit. To address this, we present CRAFT, a multi-agent red-teaming system that leverages policy-aware persuasive strategies to undermine a policy-adherent agent in a customer-service scenario, outperforming conventional jailbreak methods such as DAN prompts, emotional manipulation, and coercive. Building upon the existing tau-bench benchmark, we introduce tau-break, a complementary benchmark designed to rigorously assess the agent's robustness against manipulative user behavior. Finally, we evaluate several straightforward yet effective defense strategies. While these measures provide some protection, they fall short, highlighting the need for stronger, research-driven safeguards to protect policy-adherent agents from adversarial attacks

Problem

Research questions and friction points this paper is trying to address.

Ensuring LLM agents adhere strictly to policies while maintaining helpful interactions

Developing methods to test agent resilience against adversarial user exploits

Assessing and improving agent robustness against manipulative user behavior

Innovation

Methods, ideas, or system contributions that make the work stand out.

CRAFT: multi-agent red-teaming with policy-aware strategies

tau-break benchmark for robustness assessment

evaluates defense strategies against adversarial attacks

🔎 Similar Papers

Contrastive learning-based agent modeling for deep reinforcement learning