Benchmarking and Understanding Safety Risks in AI Character Platforms

📅 2025-11-30

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

AI role-playing platforms face severe security risks due to immersive interactions and technical vulnerabilities, yet lack systematic security evaluation. Method: We conduct the first large-scale security assessment, constructing a benchmark dataset of 5,000 questions covering 16 risk categories, and empirically evaluate 16 mainstream platforms. We further analyze the correlation between role attributes (e.g., identity, objective, constraints) and safety outcomes, and train a multivariate machine learning classifier for unsafe role prediction. Contribution/Results: Platforms exhibit an average unsafe response rate of 65.1%—significantly higher than the 17.7% baseline of general-purpose LMs. We uncover strong associations between role attributes and security failures, and achieve an F1-score of 0.81 in predicting unsafe roles. Our work delivers three key contributions: (1) a comprehensive security benchmark, (2) fine-grained attribution analysis, and (3) a deployable predictive model enabling proactive risk mitigation.

Technology Category

Application Category

📝 Abstract

AI character platforms, which allow users to engage in conversations with AI personas, are a rapidly growing application domain. However, their immersive and personalized nature, combined with technical vulnerabilities, raises significant safety concerns. Despite their popularity, a systematic evaluation of their safety has been notably absent. To address this gap, we conduct the first large-scale safety study of AI character platforms, evaluating 16 popular platforms using a benchmark set of 5,000 questions across 16 safety categories. Our findings reveal a critical safety deficit: AI character platforms exhibit an average unsafe response rate of 65.1%, substantially higher than the 17.7% average rate of the baselines. We further discover that safety performance varies significantly across different characters and is strongly correlated with character features such as demographics and personality. Leveraging these insights, we demonstrate that our machine learning model is able identify less safe characters with an F1-score of 0.81. This predictive capability can be beneficial for platforms, enabling improved mechanisms for safer interactions, character search/recommendations, and character creation. Overall, the results and findings offer valuable insights for enhancing platform governance and content moderation for safer AI character platforms.

Problem

Research questions and friction points this paper is trying to address.

Evaluates safety risks in AI character platforms

Identifies high unsafe response rates across platforms

Develops model to predict unsafe character interactions

Innovation

Methods, ideas, or system contributions that make the work stand out.

First large-scale safety study of AI character platforms

Machine learning model identifies unsafe characters with high accuracy

Benchmark evaluation across 16 safety categories using 5000 questions

🔎 Similar Papers

Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?