ConVerse: Benchmarking Contextual Safety in Agent-to-Agent Conversations

📅 2025-11-07

📈 Citations: 0

✨ Influential: 0

career value

242K/year

🤖 AI Summary

In multi-agent systems, autonomous agents face a fundamental tension between utility and privacy-security when interacting with external services. Method: This paper introduces ConVerse, a dynamic benchmark that—uniquely within a multi-turn dialogue framework—unifies evaluation of both privacy leakage and security vulnerabilities across travel, real estate, and insurance domains. It models 12 user roles and 864 context-specific adversarial scenarios, and proposes a novel three-tier privacy abstraction evaluation framework integrating contextual attack modeling, autonomous multi-turn dialogue generation, tool-use monitoring, and preference manipulation testing—assessed via semantic analysis and behavioral tracing. Contribution/Results: Experiments on seven mainstream models reveal an 88% privacy attack success rate and 60% security vulnerability trigger rate; critically, stronger models exhibit significantly higher information leakage, exposing a “capability-risk paradox” in multi-agent interaction. ConVerse thus advances safety evaluation from static, monolithic paradigms toward dynamic, interactive ones.

Technology Category

Application Category

📝 Abstract

As language models evolve into autonomous agents that act and communicate on behalf of users, ensuring safety in multi-agent ecosystems becomes a central challenge. Interactions between personal assistants and external service providers expose a core tension between utility and protection: effective collaboration requires information sharing, yet every exchange creates new attack surfaces. We introduce ConVerse, a dynamic benchmark for evaluating privacy and security risks in agent-agent interactions. ConVerse spans three practical domains (travel, real estate, insurance) with 12 user personas and over 864 contextually grounded attacks (611 privacy, 253 security). Unlike prior single-agent settings, it models autonomous, multi-turn agent-to-agent conversations where malicious requests are embedded within plausible discourse. Privacy is tested through a three-tier taxonomy assessing abstraction quality, while security attacks target tool use and preference manipulation. Evaluating seven state-of-the-art models reveals persistent vulnerabilities; privacy attacks succeed in up to 88% of cases and security breaches in up to 60%, with stronger models leaking more. By unifying privacy and security within interactive multi-agent contexts, ConVerse reframes safety as an emergent property of communication.

Problem

Research questions and friction points this paper is trying to address.

Evaluating privacy and security risks in autonomous agent-to-agent conversations

Assessing safety vulnerabilities when agents share information during collaboration

Benchmarking contextual safety across multi-turn interactions with embedded attacks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic benchmark for agent-agent interaction safety

Models multi-turn conversations with embedded attacks

Unifies privacy and security testing in communication

🔎 Similar Papers

Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?