Crowdsourcing Piedmontese to Test LLMs on Non-Standard Orthography

📅 2026-02-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study evaluates the performance of large language models on endangered languages with non-standard orthography, using Piedmontese as a case study. The authors construct the first parallel corpus of 145 Italian–Piedmontese sentence pairs via crowdsourcing, preserving native speakers’ natural writing styles and incorporating manual word alignments. Based on this dataset, they conduct three benchmark tasks: tokenization consistency, topic classification, and machine translation. Results indicate that Piedmontese exhibits weaker performance in tokenization, yet achieves topic classification accuracy comparable to high-resource languages. Translation from Piedmontese to Italian is moderately successful, whereas the reverse direction remains challenging. This work establishes the first structured evaluation framework for non-standardized endangered languages and sheds light on the capabilities and limitations of large language models in low-resource settings.

Technology Category

Application Category

📝 Abstract
We present a crowdsourced dataset for Piedmontese, an endangered Romance language of northwestern Italy. The dataset comprises 145 Italian-Piedmontese parallel sentences derived from Flores+, with translations produced by speakers writing in their natural orthographic style rather than adhering to standardized conventions, along with manual word alignment. We use this resource to benchmark several large language models on tokenization parity, topic classification, and machine translation. Our analysis reveals that Piedmontese incurs a tokenization penalty relative to higher-resource Romance languages, yet LLMs achieve classification performance approaching that of Italian, French, and English. Machine translation results are asymmetric: models translate adequately from Piedmontese into high-resource languages, but generation into Piedmontese remains challenging. The dataset and code are publicly released.
Problem

Research questions and friction points this paper is trying to address.

non-standard orthography
Piedmontese
large language models
machine translation
endangered languages
Innovation

Methods, ideas, or system contributions that make the work stand out.

crowdsourcing
non-standard orthography
low-resource language
large language models
machine translation
🔎 Similar Papers
No similar papers found.
G
Gianluca Vico
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics
Jindřich Libovický
Jindřich Libovický
Charles University
natural language processingmultilingualityneural machine translationlanguage and vision