Crowdsourcing Piedmontese to Test LLMs on Non-Standard Orthography

📅 2026-02-16

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

This study evaluates the performance of large language models on endangered languages with non-standard orthography, using Piedmontese as a case study. The authors construct the first parallel corpus of 145 Italian–Piedmontese sentence pairs via crowdsourcing, preserving native speakers’ natural writing styles and incorporating manual word alignments. Based on this dataset, they conduct three benchmark tasks: tokenization consistency, topic classification, and machine translation. Results indicate that Piedmontese exhibits weaker performance in tokenization, yet achieves topic classification accuracy comparable to high-resource languages. Translation from Piedmontese to Italian is moderately successful, whereas the reverse direction remains challenging. This work establishes the first structured evaluation framework for non-standardized endangered languages and sheds light on the capabilities and limitations of large language models in low-resource settings.

Technology Category

Application Category

📝 Abstract

We present a crowdsourced dataset for Piedmontese, an endangered Romance language of northwestern Italy. The dataset comprises 145 Italian-Piedmontese parallel sentences derived from Flores+, with translations produced by speakers writing in their natural orthographic style rather than adhering to standardized conventions, along with manual word alignment. We use this resource to benchmark several large language models on tokenization parity, topic classification, and machine translation. Our analysis reveals that Piedmontese incurs a tokenization penalty relative to higher-resource Romance languages, yet LLMs achieve classification performance approaching that of Italian, French, and English. Machine translation results are asymmetric: models translate adequately from Piedmontese into high-resource languages, but generation into Piedmontese remains challenging. The dataset and code are publicly released.

Problem

Research questions and friction points this paper is trying to address.

non-standard orthography

Piedmontese

large language models

machine translation

endangered languages

Innovation

Methods, ideas, or system contributions that make the work stand out.

crowdsourcing

non-standard orthography

low-resource language