🤖 AI Summary
Generative agents exhibit critical deficiencies in safety, behavioral consistency, and social trustworthiness within multimodal environments, particularly suffering from weak cross-modal safety reasoning, behavioral instability, and low social acceptability. Method: We introduce the first reproducible generative agent social simulation framework, integrating hierarchical memory, dynamic planning, and multimodal perception, and conduct text–vision co-simulation using Claude, GPT-4o mini, and Qwen-VL. Contribution/Results: We propose SocialMetrics—a novel evaluation suite quantifying plan revision rate, unsafe-to-safe behavior conversion rate, and information diffusion intensity—establishing a three-tier safety assessment: temporal improvement, risk detection, and social acceptability. Experiments reveal only 55% global safety alignment success; while unsafe-to-safe conversion ranges from 55% to 98%, 45% of unsafe behaviors are misclassified as acceptable due to misleading visual inputs.
📝 Abstract
Can generative agents be trusted in multimodal environments? Despite advances in large language and vision-language models that enable agents to act autonomously and pursue goals in rich settings, their ability to reason about safety, coherence, and trust across modalities remains limited. We introduce a reproducible simulation framework for evaluating agents along three dimensions: (1) safety improvement over time, including iterative plan revisions in text-visual scenarios; (2) detection of unsafe activities across multiple categories of social situations; and (3) social dynamics, measured as interaction counts and acceptance ratios of social exchanges. Agents are equipped with layered memory, dynamic planning, multimodal perception, and are instrumented with SocialMetrics, a suite of behavioral and structural metrics that quantifies plan revisions, unsafe-to-safe conversions, and information diffusion across networks. Experiments show that while agents can detect direct multimodal contradictions, they often fail to align local revisions with global safety, reaching only a 55 percent success rate in correcting unsafe plans. Across eight simulation runs with three models - Claude, GPT-4o mini, and Qwen-VL - five agents achieved average unsafe-to-safe conversion rates of 75, 55, and 58 percent, respectively. Overall performance ranged from 20 percent in multi-risk scenarios with GPT-4o mini to 98 percent in localized contexts such as fire/heat with Claude. Notably, 45 percent of unsafe actions were accepted when paired with misleading visuals, showing a strong tendency to overtrust images. These findings expose critical limitations in current architectures and provide a reproducible platform for studying multimodal safety, coherence, and social dynamics.