🤖 AI Summary
This work addresses the issue that multilingual large language models often generate culturally inconsistent outputs when prompted in different languages, even with a fixed user persona—such as assigning distinct national cultural identities to the same British character. To mitigate this, the authors propose C-3PO, a consensus-driven preference optimization framework that enforces cross-lingual cultural consistency, enabling models to faithfully preserve specified personas while remaining culturally neutral across languages. They introduce Singleton Fleiss’s κ_S, a novel metric to quantify cross-lingual cultural inconsistency, and integrate preference optimization, representation intervention, and intermediate-layer decoding during training. Experiments demonstrate that C-3PO improves κ_S by up to 0.13 over strong baselines, with particularly pronounced gains in low-resource languages such as Indonesian and Persian.
📝 Abstract
Despite their impressive capabilities, multilingual large language models (MLLMs) frequently exhibit inconsistent behaviour when the prompt's language changes. While such adaptation is generally desirable, it becomes a critical failure when a user's identity is explicitly defined. For instance, given a fixed British persona and an ambiguous everyday knowledge query about literature, the prompt's language frequently overwrites the system persona -- yielding Shakespeare in English but Cervantes in Spanish. To robustly quantify this Cross-lingual Cultural Inconsistency, we introduce Singleton Fleiss's $\kappa_S$, a metric mathematically resilient to hallucinations. For mitigation, we propose Cross-lingual Cultural Consistent Preference Optimisation (C-3PO), a consensus-driven alignment framework. C-3PO achieves up to a 0.10-point absolute increase in $\kappa_S$ over unaligned models, outperforming strong prompting and representation steering baselines. Empirical evaluations show this inconsistency disproportionately affects lower-resource languages like Indonesian and Persian. A layer-wise interpretability analysis reveals the underlying mechanism: by early-decoding intermediate layer representations, we find that MLLMs implicitly personalise outputs towards the prompt language's stereotypical culture as forward-pass representations stabilise.