Benchmarking EngGPT2-16B-A3B against Comparable Italian and International Open-source LLMs

📅 2026-05-08

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

This study addresses the limited comprehensive performance of native Italian large language models on multilingual and specialized tasks by systematically evaluating EngGPT2-16B-A3B, a 16-billion-parameter sparse mixture-of-experts (MoE) model developed by ENGINEERING. The model is rigorously benchmarked across international and local datasets—including ARC, GSM8K, MMLU, HumanEval, ITALIC, and RULER—and compared against prominent open-source dense and MoE architectures. As the first comprehensive evaluation of an Italian open-source large language model, this work demonstrates the competitiveness of domestically developed MoE designs in multitask settings. Notably, EngGPT2-16B-A3B achieves state-of-the-art performance on the 32k-context RULER benchmark and outperforms other domestic models overall, approaching the capabilities of select top-tier international open-source counterparts.

📝 Abstract

This report benchmarks the performance of ENGINEERING Ingegneria Informatica S.p.A.'s EngGPT2MoE-16B-A3B LLM, a 16B parameter Mixture of Experts (MoE) model with 3B active parameters. Performance is investigated across a wide variety of representative benchmarks, and is compared against comparably-sized open-source MoE and dense models. In comparison with popular Italian models, namely FastwebMIIA-7B, Minerva-7B, Velvet-14B, and LLaMAntino-3-ANITA-8B, EngGPT2MoE-16B-A3B performs as well or better on international benchmarks: ARC-Challenge, GSM8K, AIME24, AIME25, MMLU, and HumanEval (HE). It achieves the best performance for the longest context setting (32k) of the RULER benchmark. On the Italian benchmark dataset ITALIC, the model performs as well or better than the other models except for Velvet-14B, which outperforms it. Compared with popular MoE models of comparable size, the new model reports higher values than DeepSeek-MoE-16B-Chat on all considered benchmarks. It has higher values than Moonlight-16B-A3B on HE, MMLU, AIME24, AIME25, GSM8K, and the 32k RULER setting, but lower on BFCL and some ARC and ITALIC settings. Finally it has lower values than GPT-OSS-20B on most benchmarks, including HE, MMLU, AIME24, AIME25, GSM8K, ARC, BFCL, and the RULER 32k. When compared with popular dense models, EngGPT2MoE-16B-A3B reports higher values on AIME24 and AIME25 than Llama-3.1-8B-Instruct, Gemma-3-12b-it, and Ministral-3-8BInstruct-2512-BF16, but lower values on ITALIC, BFCL, and RULER with a 32k context. When performance is aggregated across all benchmark metrics, EngGPT2MoE-16B-A3B shows higher performance than the Italian models under evaluation while achieving lower results than some of the most performant international models, in particular GPT-5 nano and Qwen3-8B. Taken together, our findings find the new model to be a step forward for native Italian Large Language Models.

Problem

Research questions and friction points this paper is trying to address.

benchmarking

large language models

Mixture of Experts

Italian LLMs

model evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture of Experts (MoE)

Italian LLM

long-context evaluation