Multimodal Safety Evaluation in Generative Agent Social Simulations

📅 2025-10-08

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

Generative agents exhibit critical deficiencies in safety, behavioral consistency, and social trustworthiness within multimodal environments, particularly suffering from weak cross-modal safety reasoning, behavioral instability, and low social acceptability. Method: We introduce the first reproducible generative agent social simulation framework, integrating hierarchical memory, dynamic planning, and multimodal perception, and conduct text–vision co-simulation using Claude, GPT-4o mini, and Qwen-VL. Contribution/Results: We propose SocialMetrics—a novel evaluation suite quantifying plan revision rate, unsafe-to-safe behavior conversion rate, and information diffusion intensity—establishing a three-tier safety assessment: temporal improvement, risk detection, and social acceptability. Experiments reveal only 55% global safety alignment success; while unsafe-to-safe conversion ranges from 55% to 98%, 45% of unsafe behaviors are misclassified as acceptable due to misleading visual inputs.

Technology Category

Application Category

📝 Abstract

Can generative agents be trusted in multimodal environments? Despite advances in large language and vision-language models that enable agents to act autonomously and pursue goals in rich settings, their ability to reason about safety, coherence, and trust across modalities remains limited. We introduce a reproducible simulation framework for evaluating agents along three dimensions: (1) safety improvement over time, including iterative plan revisions in text-visual scenarios; (2) detection of unsafe activities across multiple categories of social situations; and (3) social dynamics, measured as interaction counts and acceptance ratios of social exchanges. Agents are equipped with layered memory, dynamic planning, multimodal perception, and are instrumented with SocialMetrics, a suite of behavioral and structural metrics that quantifies plan revisions, unsafe-to-safe conversions, and information diffusion across networks. Experiments show that while agents can detect direct multimodal contradictions, they often fail to align local revisions with global safety, reaching only a 55 percent success rate in correcting unsafe plans. Across eight simulation runs with three models - Claude, GPT-4o mini, and Qwen-VL - five agents achieved average unsafe-to-safe conversion rates of 75, 55, and 58 percent, respectively. Overall performance ranged from 20 percent in multi-risk scenarios with GPT-4o mini to 98 percent in localized contexts such as fire/heat with Claude. Notably, 45 percent of unsafe actions were accepted when paired with misleading visuals, showing a strong tendency to overtrust images. These findings expose critical limitations in current architectures and provide a reproducible platform for studying multimodal safety, coherence, and social dynamics.

Problem

Research questions and friction points this paper is trying to address.

Evaluating safety and trust in multimodal generative agents

Assessing agents' ability to detect unsafe social activities

Measuring social dynamics and safety alignment in simulations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Simulation framework evaluates multimodal agent safety

Agents use layered memory and dynamic planning

SocialMetrics quantify plan revisions and safety conversions

🔎 Similar Papers

HAICOSYSTEM: An Ecosystem for Sandboxing Safety Risks in Human-AI Interactions