SiDiaC-v.2.0: Sinhala Diachronic Corpus Version 2.0

📅 2026-03-11

📈 Citations: 0

✨ Influential: 0

career value

150K/year

🤖 AI Summary

This study addresses the longstanding scarcity of historically deep and linguistically representative diachronic corpora for Sinhala, a low-resource language, which has hindered natural language processing (NLP) research. We present the largest Sinhala diachronic corpus to date, comprising 244,000 words from 185 literary works published between 1800 and 1955, with a 70,000-word subset annotated for publication year. Our approach innovatively combines copyright-compliant data collection, dual-genre classification (fiction/non-fiction and fine-grained categories), and cross-lingual strategies for low-resource corpus construction. Text extraction leverages Google Document AI OCR followed by multi-stage post-processing—including layout restoration, code-mixing handling, and malformed token correction—and adopts normalization and annotation methodologies inspired by FarPaHC and CCOHA. This resource substantially extends the coverage and utility of SiDiaC-v1.0, providing a foundational dataset for Sinhala NLP.

Technology Category

Application Category

📝 Abstract

SiDiaC-v.2.0 is the largest comprehensive Sinhala Diachronic Corpus to date, covering a period from 1800 CE to 1955 CE in terms of publication dates, and a historical span from the 5th to the 20th century CE in terms of written dates. The corpus consists of 244k words across 185 literary works that underwent thorough filtering, preprocessing, and copyright compliance checks, followed by extensive post-processing. Additionally, a subset of 59 documents totalling 70k words was annotated based on their written dates. Texts from the National Library of Sri Lanka were selected from the SiDiaC-v.1.0 non-filtered list, which was digitised using Google Document AI OCR. This was followed by post-processing to correct formatting issues, address code-mixing, include special tokens, and fix malformed tokens. The construction of SiDiaC-v.2.0 was informed by practices from other corpora, such as FarPaHC, SiDiaC-v.1.0, and CCOHA. This was particularly relevant for syntactic annotation and text normalisation strategies, given the shared characteristics of low-resource language status between Faroese and the similar cleaning strategies utilised in CCOHA. This corpus is categorised into two layers based on genres: primary and secondary. The primary categorisation is binary, assigning each book to either Non-Fiction or Fiction. The secondary categorisation is more detailed, grouping texts under specific genres such as Religious, History, Poetry, Language, and Medical. Despite facing challenges due to limited resources, SiDiaC-v.2.0 serves as a comprehensive resource for Sinhala NLP, building upon the work previously done in SiDiaC-v.1.0.

Problem

Research questions and friction points this paper is trying to address.

Sinhala

diachronic corpus

low-resource language

NLP

historical texts

Innovation

Methods, ideas, or system contributions that make the work stand out.

diachronic corpus

low-resource NLP

text normalization