SumTablets: A Transliteration Dataset of Sumerian Tablets

📅 2026-02-25

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

This study addresses the scarcity of structurally aligned parallel data between Sumerian cuneiform and Latin transliteration, which has hindered the application of natural language processing (NLP) methods in this domain. To overcome this limitation, the authors introduce SumTablets, a novel dataset comprising 91,606 cuneiform tablets—containing 6,970,407 Unicode glyphs—paired with their Oracc transliterations in end-to-end alignment. The dataset preserves critical structural information such as surface boundaries, line breaks, and lacunae through dedicated special tokens, enabling structure-aware modeling. The project provides open-access data and preprocessing code to support fine-tuning of autoregressive language models. Evaluated on character-level F-score (chrF), the approach achieves 97.55, substantially improving transliteration efficiency and allowing domain experts to focus on validation rather than manual, character-by-character transcription.

Technology Category

Application Category

📝 Abstract

Sumerian transliteration is a conventional system for representing a scholar's interpretation of a tablet in the Latin script. Thanks to visionary digital Assyriology projects such as ETCSL, CDLI, and Oracc, a large number of Sumerian transliterations have been published online, and these data are well-structured for a variety of search and analysis tasks. However, the absence of a comprehensive, accessible dataset pairing transliterations with a digital representation of the tablet's cuneiform glyphs has prevented the application of modern Natural Language Processing (NLP) methods to the task of Sumerian transliteration. To address this gap, we present SumTablets, a dataset pairing Unicode representations of 91,606 Sumerian cuneiform tablets (totaling 6,970,407 glyphs) with the associated transliterations published by Oracc. We construct SumTablets by first preprocessing and standardizing the Oracc transliterations before mapping each reading back to the Unicode representation of the source glyph. Further, we retain parallel structural information (e.g., surfaces, newlines, broken segments) through the use of special tokens. We release SumTablets as a Hugging Face Dataset (CC BY 4.0) and open source data preparation code via GitHub. Additionally, we leverage SumTablets to implement and evaluate two transliteration baselines: (1) weighted sampling from a glyph's possible readings, and (2) fine-tuning an autoregressive language model. Our fine-tuned language model achieves an average transliteration character-level F-score (chrF) of 97.55, demonstrating the immediate potential of transformer-based transliteration models in allowing experts to rapidly verify generated transliterations rather than manually transliterating tablets one-by-one.

Problem

Research questions and friction points this paper is trying to address.

Sumerian transliteration

cuneiform glyphs

NLP

dataset

digital Assyriology

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sumerian transliteration

cuneiform

aligned dataset