Challenging the Abilities of Large Language Models in Italian: a Community Initiative

📅 2025-12-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Systematic evaluation of large language models (LLMs) for non-English languages—particularly Italian—suffers from fragmented benchmarks, inconsistent frameworks, and a lack of comprehensive, standardized assessment protocols. Method: We introduce the first holistic evaluation benchmark for Italian LLMs, covering over 20 tasks across language understanding, reasoning, translation, and more. We propose a novel rolling community-driven evaluation paradigm, coupled with a unified assessment framework and a fine-grained metric taxonomy. An automated, heterogeneous data integration pipeline enables multi-task, multi-metric evaluation. Contribution/Results: Leveraging open-weight models, we conduct the first comprehensive capability analysis of four major Italian LLMs, revealing previously undocumented cross-task performance disparities. All benchmark datasets, evaluation tooling, and results are fully open-sourced, establishing a reproducible, sustainable, and openly shared methodological foundation for evaluating low-resource language models.

Technology Category

Application Category

📝 Abstract
The rapid progress of Large Language Models (LLMs) has transformed natural language processing and broadened its impact across research and society. Yet, systematic evaluation of these models, especially for languages beyond English, remains limited. "Challenging the Abilities of LAnguage Models in ITAlian" (CALAMITA) is a large-scale collaborative benchmarking initiative for Italian, coordinated under the Italian Association for Computational Linguistics. Unlike existing efforts that focus on leaderboards, CALAMITA foregrounds methodology: it federates more than 80 contributors from academia, industry, and the public sector to design, document, and evaluate a diverse collection of tasks, covering linguistic competence, commonsense reasoning, factual consistency, fairness, summarization, translation, and code generation. Through this process, we not only assembled a benchmark of over 20 tasks and almost 100 subtasks, but also established a centralized evaluation pipeline that supports heterogeneous datasets and metrics. We report results for four open-weight LLMs, highlighting systematic strengths and weaknesses across abilities, as well as challenges in task-specific evaluation. Beyond quantitative results, CALAMITA exposes methodological lessons: the necessity of fine-grained, task-representative metrics, the importance of harmonized pipelines, and the benefits and limitations of broad community engagement. CALAMITA is conceived as a rolling benchmark, enabling continuous integration of new tasks and models. This makes it both a resource -- the most comprehensive and diverse benchmark for Italian to date -- and a framework for sustainable, community-driven evaluation. We argue that this combination offers a blueprint for other languages and communities seeking inclusive and rigorous LLM evaluation practices.
Problem

Research questions and friction points this paper is trying to address.

Evaluates Italian language models' linguistic and reasoning abilities
Addresses lack of systematic non-English LLM benchmarking
Creates a community-driven framework for continuous model assessment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Federated community collaboration for diverse task design
Centralized evaluation pipeline supporting heterogeneous datasets
Rolling benchmark framework for continuous integration
🔎 Similar Papers
No similar papers found.
Malvina Nissim
Malvina Nissim
Professor of Computational Linguistics and Society, Rijksuniversiteit Groningen
Computational LinguisticsLanguage TechnologyNatural Language ProcessingLinguisticsDigital Humanities
Danilo Croce
Danilo Croce
University of Rome Tor Vergata
Viviana Patti
Viviana Patti
Associate Professor of Computer Science, Università diTorino, Dipartimento di Informatica
artificial intelligencenatural language processingirony detectionsentiment analysissocial semantic web
Pierpaolo Basile
Pierpaolo Basile
Associate Professor, University of Bari Aldo Moro
Natural Language ProcessingInformation RetrievalArtificial IntelligenceSemanticsComputational Linguistics
Giuseppe Attanasio
Giuseppe Attanasio
Postdoctoral Researcher, Instituto de Telecomunicações
AIFairnessTransparencySafety
E
Elio Musacchio
University of Bari Aldo Moro
M
Matteo Rinaldi
University of Turin
F
Federico Borazio
University of Rome Tor Vergata
Maria Francis
Maria Francis
University of Groningen
J
Jacopo Gili
University of Turin
Daniel Scalena
Daniel Scalena
Joint PhD @ University of Milano - Bicocca, University of Groningen
Artificial IntelligenceNatural language processingInterpretability
Begoña Altuna
Begoña Altuna
Universidad del País Vasco/ Euskal Herriko Unibertsitatea (University of the Basque Country)
Natural Language Processing
E
Ekhi Azurmendi
University of the Basque Country (UPV/EHU)
Valerio Basile
Valerio Basile
University of Turin
Data PerspectivismLanguage ResourcesComputational SemanticsNatural Language Processing
Luisa Bentivogli
Luisa Bentivogli
Head of the Machine Translation Group at Fondazione Bruno Kessler, Italy
Machine TranslationSpeech TranslationEvaluationGender Inclusive MT
Arianna Bisazza
Arianna Bisazza
Associate Professor, University of Groningen
Natural Language ProcessingMultilingual NLPInterpretabilityLanguage Learning in Humans vs Mach
M
Marianna Bolognesi
University of Bologna
Dominique Brunato
Dominique Brunato
ILC-CNR
Tommaso Caselli
Tommaso Caselli
Assistant Professor, Faculty of Arts, Rijksuniveristeit Groningen
Lexical SemanticsLexical ResourcesTemporal ProcessingSentiment AnalysisEvent Detection
Silvia Casola
Silvia Casola
LMU
Natural Language ProcessingMachine learning
M
Maria Cassese
ISTI-CNR
Mauro Cettolo
Mauro Cettolo
Researcher at Fondazione Bruno Kessler, Trento (Italy)
Natural Language ProcessingStatistical Machine TranslationAutomatic Speech Recognition
C
Claudia Collacciani
University of Bologna
L
Leonardo De Cosmo
ANSA
M
Maria Pia Di Buono
University of Naples "L’Orientale"