Evaluating Prompt-Driven Chinese Large Language Models: The Influence of Persona Assignment on Stereotypes and Safeguards

📅 2025-06-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the culturally specific impact of role prompting on stereotyping and safety mitigation in Chinese large language models (LLMs), particularly Qwen, with respect to toxic content generation targeting Chinese social groups. Method: We propose a training-free, multi-model collaborative feedback framework: Qwen generates responses, a fine-tuned BERT classifier detects toxicity, and an external evaluator iteratively refines outputs. Contribution/Results: Empirical analysis reveals that adversarial role prompts amplify toxicity toward Chinese social groups by up to 60×. Our method achieves >92% refusal rate while reducing toxicity by 78%, significantly enhancing cultural sensitivity and safety robustness. This work is the first to systematically uncover the bias-amplification mechanism induced by role prompting in Chinese-language LLMs and provides a scalable, culture-adaptive safety governance framework for non-Western LLMs.

Technology Category

Application Category

📝 Abstract
Recent research has highlighted that assigning specific personas to large language models (LLMs) can significantly increase harmful content generation. Yet, limited attention has been given to persona-driven toxicity in non-Western contexts, particularly in Chinese-based LLMs. In this paper, we perform a large-scale, systematic analysis of how persona assignment influences refusal behavior and response toxicity in Qwen, a widely-used Chinese language model. Utilizing fine-tuned BERT classifiers and regression analysis, our study reveals significant gender biases in refusal rates and demonstrates that certain negative personas can amplify toxicity toward Chinese social groups by up to 60-fold compared to the default model. To mitigate this toxicity, we propose an innovative multi-model feedback strategy, employing iterative interactions between Qwen and an external evaluator, which effectively reduces toxic outputs without costly model retraining. Our findings emphasize the necessity of culturally specific analyses for LLMs safety and offer a practical framework for evaluating and enhancing ethical alignment in LLM-generated content.
Problem

Research questions and friction points this paper is trying to address.

Evaluating persona-driven toxicity in Chinese LLMs
Analyzing gender biases and refusal behavior in Qwen
Proposing multi-model feedback to reduce toxic outputs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuned BERT classifiers analyze toxicity
Multi-model feedback reduces toxic outputs
Regression analysis reveals gender biases
🔎 Similar Papers
No similar papers found.
Geng Liu
Geng Liu
The Greater Bay Area National Center of Technology Innovation
High-Performance ComputingCFDLBM
Li Feng
Li Feng
Associate Professor of Radiology & Director of Rapid Imaging, NYU Grossman School of Medicine
Magnetic Resonance ImagingImage Reconstruction
C
Carlo Alberto Bono
Department of Electronics, Information and Bioengineering, Politecnico di Milano, Italy
S
Songbo Yang
University of Science and Technology of China, China
Mengxiao Zhu
Mengxiao Zhu
University of Science and Technology of China
Computational Social ScienceBig Data AnalysisSocial Network AnalysisLearning Analytics
F
Francesco Pierri
Department of Electronics, Information and Bioengineering, Politecnico di Milano, Italy