🤖 AI Summary
This study evaluates the performance of large language models on endangered languages with non-standard orthography, using Piedmontese as a case study. The authors construct the first parallel corpus of 145 Italian–Piedmontese sentence pairs via crowdsourcing, preserving native speakers’ natural writing styles and incorporating manual word alignments. Based on this dataset, they conduct three benchmark tasks: tokenization consistency, topic classification, and machine translation. Results indicate that Piedmontese exhibits weaker performance in tokenization, yet achieves topic classification accuracy comparable to high-resource languages. Translation from Piedmontese to Italian is moderately successful, whereas the reverse direction remains challenging. This work establishes the first structured evaluation framework for non-standardized endangered languages and sheds light on the capabilities and limitations of large language models in low-resource settings.
📝 Abstract
We present a crowdsourced dataset for Piedmontese, an endangered Romance language of northwestern Italy. The dataset comprises 145 Italian-Piedmontese parallel sentences derived from Flores+, with translations produced by speakers writing in their natural orthographic style rather than adhering to standardized conventions, along with manual word alignment. We use this resource to benchmark several large language models on tokenization parity, topic classification, and machine translation. Our analysis reveals that Piedmontese incurs a tokenization penalty relative to higher-resource Romance languages, yet LLMs achieve classification performance approaching that of Italian, French, and English. Machine translation results are asymmetric: models translate adequately from Piedmontese into high-resource languages, but generation into Piedmontese remains challenging. The dataset and code are publicly released.