Do LLMs exhibit the same commonsense capabilities across languages?

📅 2025-09-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the consistency of large language models’ (LLMs) commonsense generation capabilities across multilingual settings. Method: We introduce MULTICOM, the first benchmark for multilingual commonsense generation, extending COCO-Tero to English, Spanish, Dutch, and Valencian—including low-resource variants—and systematically evaluate prominent open-source models (LLaMA, Qwen, Gemma, EuroLLM, Salamandra) using a tripartite evaluation framework combining automated metrics, judge models (Prometheus/JudgeLM), and human annotation, while quantifying contextual support effects per language. Contribution/Results: Results reveal strong performance disparities: models achieve highest accuracy on English, with substantial degradation on low-resource languages; context augmentation yields only marginal gains. Our work identifies critical bottlenecks in current multilingual commonsense reasoning and establishes MULTICOM as a novel, fair, fine-grained benchmark for cross-lingual capability assessment, supported by empirical evidence.

Technology Category

Application Category

📝 Abstract
This paper explores the multilingual commonsense generation abilities of Large Language Models (LLMs). To facilitate this investigation, we introduce MULTICOM, a novel benchmark that extends the COCOTEROS dataset to four languages: English, Spanish, Dutch, and Valencian. The task involves generating a commonsensical sentence that includes a given triplet of words. We evaluate a range of open-source LLMs, including LLaMA, Qwen, Gemma, EuroLLM, and Salamandra, on this benchmark. Our evaluation combines automatic metrics, LLM-as-a-judge approaches (using Prometheus and JudgeLM), and human annotations. Results consistently show superior performance in English, with significantly lower performance in less-resourced languages. While contextual support yields mixed results, it tends to benefit underrepresented languages. These findings underscore the current limitations of LLMs in multilingual commonsense generation. The dataset is publicly available at https://huggingface.co/datasets/gplsi/MULTICOM.
Problem

Research questions and friction points this paper is trying to address.

Evaluating multilingual commonsense generation in LLMs
Assessing performance gaps across resource-rich and low-resource languages
Testing contextual support effectiveness for underrepresented languages
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual benchmark extension for commonsense evaluation
Combined automatic and human evaluation methodologies
Contextual support analysis for underrepresented languages
🔎 Similar Papers
No similar papers found.