What if I ask in extit{alia lingua}? Measuring Functional Similarity Across Languages

📅 2025-09-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the problem of evaluating functional consistency—i.e., whether large language models (LLMs) produce semantically equivalent outputs across languages. We propose κₚ, a novel metric grounded in response function equivalence, enabling systematic multilingual self-consistency assessment. Applying κₚ to the GlobalMMLU benchmark, we conduct the first large-scale analysis across 20 languages and 47 academic disciplines. Our results show: (1) cross-lingual self-consistency improves significantly with model parameter count; (2) a given model exhibits higher cross-lingual consistency than the inter-model consensus within the same language; and (3) κₚ effectively discriminates between model capabilities, offering an interpretable, task-agnostic benchmark for multilingual reliability. This study uncovers an intrinsic link between linguistic generalization and model scale, providing both theoretical insights and practical tools for optimizing consistency in multilingual LLM systems.

Technology Category

Application Category

📝 Abstract
How similar are model outputs across languages? In this work, we study this question using a recently proposed model similarity metric $κ_p$ applied to 20 languages and 47 subjects in GlobalMMLU. Our analysis reveals that a model's responses become increasingly consistent across languages as its size and capability grow. Interestingly, models exhibit greater cross-lingual consistency within themselves than agreement with other models prompted in the same language. These results highlight not only the value of $κ_p$ as a practical tool for evaluating multilingual reliability, but also its potential to guide the development of more consistent multilingual systems.
Problem

Research questions and friction points this paper is trying to address.

Measuring functional similarity across languages using model outputs
Assessing cross-lingual consistency in model responses across 20 languages
Evaluating multilingual reliability and consistency in AI systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Applied κ_p metric to multilingual model similarity
Analyzed cross-lingual consistency across 20 languages
Used model size and capability growth correlation
🔎 Similar Papers
No similar papers found.
D
Debangan Mishra
IIIT Hyderabad
A
Arihant Rastogi
IIIT Hyderabad
A
Agyeya Negi
IIIT Hyderabad
Shashwat Goel
Shashwat Goel
ELLIS, Max Planck Institute for Intelligent Systems Tübingen
EvaluationsScience of Deep LearningScaling SupervisionAI Safety
P
Ponnurangam Kumaraguru
IIIT Hyderabad