Polyglots or Multitudes? Multilingual LLM Answers to Value-laden Multiple-Choice Questions

📅 2026-02-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates whether multilingual large language models exhibit consistent value expressions across languages when answering value-laden multiple-choice questions—manifesting as a unified “polyglot” or as distinct “monolinguals” per language. To this end, we introduce MEVS, the first human-translated, cross-lingually aligned multilingual value survey corpus covering eight European languages. Under rigorously controlled prompt conditions—including answer ordering and symbol types—we conduct a systematic evaluation of over 30 prominent large language models. Our findings reveal that, despite generally high cross-lingual value consistency among instruction-tuned models, their responses on certain items remain significantly influenced by the query language, thereby exposing notable limitations and complexities in current multilingual models’ value alignment.

Technology Category

Application Category

📝 Abstract
Multiple-Choice Questions (MCQs) are often used to assess knowledge, reasoning abilities, and even values encoded in large language models (LLMs). While the effect of multilingualism has been studied on LLM factual recall, this paper seeks to investigate the less explored question of language-induced variation in value-laden MCQ responses. Are multilingual LLMs consistent in their responses across languages, i.e. behave like theoretical polyglots, or do they answer value-laden MCQs depending on the language of the question, like a multitude of monolingual models expressing different values through a single model? We release a new corpus, the Multilingual European Value Survey (MEVS), which, unlike prior work relying on machine translation or ad hoc prompts, solely comprises human-translated survey questions aligned in 8 European languages. We administer a subset of those questions to over thirty multilingual LLMs of various sizes, manufacturers and alignment-fine-tuning status under comprehensive, controlled prompt variations including answer order, symbol type, and tail character. Our results show that while larger, instruction-tuned models display higher overall consistency, the robustness of their responses varies greatly across questions, with certain MCQs eliciting total agreement within and across models while others leave LLM answers split. Language-specific behavior seems to arise in all consistent, instruction-fine-tuned models, but only on certain questions, warranting a further study of the selective effect of preference fine-tuning.
Problem

Research questions and friction points this paper is trying to address.

multilingual LLMs
value-laden questions
language-induced variation
multiple-choice questions
cross-lingual consistency
Innovation

Methods, ideas, or system contributions that make the work stand out.

multilingual LLMs
value-laden MCQs
human-translated corpus
cross-lingual consistency
preference fine-tuning
🔎 Similar Papers
No similar papers found.
L
Léo Labat
Sorbonne Université, CNRS, ISIR, Paris, France
E
Etienne Ollion
CREST, CNRS, Institut Polytechnique de Paris, France
François Yvon
François Yvon
ISIR / CNRS et Sorbonne Université
Natural Language ProcessingSpeech ProcessingComputational LinguisticsMachine Translation