Evalita-LLM: Benchmarking Large Language Models on Italian

📅 2025-02-04

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

The absence of fair and robust evaluation benchmarks for Italian large language models (LLMs) hinders reliable performance assessment and progress tracking. Method: We introduce ItaBench—the first native Italian-language evaluation benchmark—featuring a novel integration of multiple-choice (discriminative) and generative tasks. To mitigate prompt sensitivity and cultural bias, we propose a multi-prompt iterative verification framework grounded in cross-prompt consistency analysis. Our methodology further incorporates task-prompt co-selection, multi-model cross-validation, and a standardized statistical evaluation protocol. Contribution/Results: Experiments demonstrate that ItaBench substantially improves assessment stability and cross-model comparability. We publicly release comprehensive benchmark data and systematic performance comparisons across 12 diverse tasks for leading state-of-the-art Italian LLMs, establishing a rigorous, reproducible evaluation infrastructure for the Italian LLM research community.

Technology Category

Application Category

📝 Abstract

We describe Evalita-LLM, a new benchmark designed to evaluate Large Language Models (LLMs) on Italian tasks. The distinguishing and innovative features of Evalita-LLM are the following: (i) all tasks are native Italian, avoiding issues of translating from Italian and potential cultural biases; (ii) in addition to well established multiple-choice tasks, the benchmark includes generative tasks, enabling more natural interaction with LLMs; (iii) all tasks are evaluated against multiple prompts, this way mitigating the model sensitivity to specific prompts and allowing a fairer and objective evaluation. We propose an iterative methodology, where candidate tasks and candidate prompts are validated against a set of LLMs used for development. We report experimental results from the benchmark's development phase, and provide performance statistics for several state-of-the-art LLMs.

Problem

Research questions and friction points this paper is trying to address.

Evaluate LLMs on Italian tasks

Avoid translation and cultural biases

Include generative and multiple-choice tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Native Italian tasks

Generative tasks included

Multiple prompt evaluations

🔎 Similar Papers

No similar papers found.