Evaluating Prompt-Driven Chinese Large Language Models: The Influence of Persona Assignment on Stereotypes and Safeguards

📅 2025-06-05

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

This study investigates the culturally specific impact of role prompting on stereotyping and safety mitigation in Chinese large language models (LLMs), particularly Qwen, with respect to toxic content generation targeting Chinese social groups. Method: We propose a training-free, multi-model collaborative feedback framework: Qwen generates responses, a fine-tuned BERT classifier detects toxicity, and an external evaluator iteratively refines outputs. Contribution/Results: Empirical analysis reveals that adversarial role prompts amplify toxicity toward Chinese social groups by up to 60×. Our method achieves >92% refusal rate while reducing toxicity by 78%, significantly enhancing cultural sensitivity and safety robustness. This work is the first to systematically uncover the bias-amplification mechanism induced by role prompting in Chinese-language LLMs and provides a scalable, culture-adaptive safety governance framework for non-Western LLMs.

Technology Category

Application Category

📝 Abstract

Recent research has highlighted that assigning specific personas to large language models (LLMs) can significantly increase harmful content generation. Yet, limited attention has been given to persona-driven toxicity in non-Western contexts, particularly in Chinese-based LLMs. In this paper, we perform a large-scale, systematic analysis of how persona assignment influences refusal behavior and response toxicity in Qwen, a widely-used Chinese language model. Utilizing fine-tuned BERT classifiers and regression analysis, our study reveals significant gender biases in refusal rates and demonstrates that certain negative personas can amplify toxicity toward Chinese social groups by up to 60-fold compared to the default model. To mitigate this toxicity, we propose an innovative multi-model feedback strategy, employing iterative interactions between Qwen and an external evaluator, which effectively reduces toxic outputs without costly model retraining. Our findings emphasize the necessity of culturally specific analyses for LLMs safety and offer a practical framework for evaluating and enhancing ethical alignment in LLM-generated content.

Problem

Research questions and friction points this paper is trying to address.

Evaluating persona-driven toxicity in Chinese LLMs

Analyzing gender biases and refusal behavior in Qwen

Proposing multi-model feedback to reduce toxic outputs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuned BERT classifiers analyze toxicity

Multi-model feedback reduces toxic outputs

Regression analysis reveals gender biases

🔎 Similar Papers

No similar papers found.