XCOMPS: A Multilingual Benchmark of Conceptual Minimal Pairs

📅 2025-02-27

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This work addresses the lack of evaluation frameworks for large language models’ (LLMs) conceptual understanding in multilingual settings—particularly for low-resource and morphologically complex languages—by introducing XCOMPS, the first cross-lingual minimal-pair benchmark covering 17 languages. Methodologically, it proposes a cross-lingual conceptual minimal-pair paradigm, integrating metalinguistic prompting, direct probability measurement, neuro-linguistic probing, and multi-model comparison (base, instruction-tuned, and knowledge-distilled variants). Key contributions include: (1) the first empirical demonstration of significant cross-lingual imbalance in LLMs’ conceptual understanding; (2) identification of a decoupling effect—instruction tuning improves explicit task performance but not intrinsic conceptual competence, whereas knowledge distillation enhances the latter without substantial gains on explicit tasks; (3) confirmation that morphological complexity correlates positively with required network depth; and (4) identification of semantically similar distractors as a critical challenge. Results further show knowledge distillation improves intrinsic conceptual ability in low-resource languages, albeit with limited explicit-task benefits.

Technology Category

Application Category

📝 Abstract

We introduce XCOMPS in this work, a multilingual conceptual minimal pair dataset covering 17 languages. Using this dataset, we evaluate LLMs' multilingual conceptual understanding through metalinguistic prompting, direct probability measurement, and neurolinguistic probing. By comparing base, instruction-tuned, and knowledge-distilled models, we find that: 1) LLMs exhibit weaker conceptual understanding for low-resource languages, and accuracy varies across languages despite being tested on the same concept sets. 2) LLMs excel at distinguishing concept-property pairs that are visibly different but exhibit a marked performance drop when negative pairs share subtle semantic similarities. 3) Instruction tuning improves performance in concept understanding but does not enhance internal competence; knowledge distillation can enhance internal competence in conceptual understanding for low-resource languages with limited gains in explicit task performance. 4) More morphologically complex languages yield lower concept understanding scores and require deeper layers for conceptual reasoning.

Problem

Research questions and friction points this paper is trying to address.

Multilingual conceptual understanding evaluation

Performance variation across low-resource languages

Impact of instruction tuning and knowledge distillation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual conceptual minimal pairs

Metalinguistic prompting evaluation

Knowledge distillation enhancement

🔎 Similar Papers

M4U: Evaluating Multilingual Understanding and Reasoning for Large Multimodal Models