Cross-Lingual Transfer of Debiasing and Detoxification in Multilingual LLMs: An Extensive Investigation

๐Ÿ“… 2024-12-18
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Multilingual large language models (LLMs) exhibit significantly heightened bias and toxicity when prompted in non-English languages. Method: We conduct supervised fine-tuning (SFT), direct preference optimization (DPO), and non-harmful text fine-tuning exclusively on English data, then systematically evaluate cross-lingual generalization of safety improvements across diverse target languages. Contribution/Results: Only DPO robustly reduces multilingual toxicity, and English-only interventions generalize cross-lingually. Crucially, we establish for the first time that pretraining monolingual data volume is a strong predictor of cross-lingual safety transfer: greater English pretraining data yields stronger transfer but concurrently degrades fluency and lexical diversity in non-English generations. These findings challenge the sufficiency of universal English-centric safety tuning for multilingual deployment and underscore the necessity of language-specific governance strategies. Our work provides both theoretical grounding and methodological guidance for mitigating bias in multilingual LLMs.

Technology Category

Application Category

๐Ÿ“ Abstract
Recent generative large language models (LLMs) show remarkable performance in non-English languages, but when prompted in those languages they tend to express higher harmful social biases and toxicity levels. Prior work has shown that finetuning on specialized datasets can mitigate this behavior, and doing so in English can transfer to other languages. In this work, we investigate the impact of different finetuning methods on the model's bias and toxicity, but also on its ability to produce fluent and diverse text. We reduce biases by finetuning on curated non-harmful text, but find only direct preference optimization to be effective for mitigating toxicity. The mitigation caused by applying these methods in English also transfers to non-English languages. We find evidence that the extent to which transfer takes place can be predicted by the amount of data in a given language present in the model's pretraining data. However, this transfer of bias and toxicity mitigation often comes at the expense of decreased language generation ability in non-English languages, highlighting the importance of developing language-specific bias and toxicity mitigation methods.
Problem

Research questions and friction points this paper is trying to address.

Mitigate harmful social biases
Reduce toxicity in multilingual LLMs
Preserve fluent text generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Finetuning on non-harmful text
Direct preference optimization
Cross-lingual transfer mitigation
๐Ÿ”Ž Similar Papers
No similar papers found.