SiDiaC-v.2.0: Sinhala Diachronic Corpus Version 2.0

📅 2026-03-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the longstanding scarcity of historically deep and linguistically representative diachronic corpora for Sinhala, a low-resource language, which has hindered natural language processing (NLP) research. We present the largest Sinhala diachronic corpus to date, comprising 244,000 words from 185 literary works published between 1800 and 1955, with a 70,000-word subset annotated for publication year. Our approach innovatively combines copyright-compliant data collection, dual-genre classification (fiction/non-fiction and fine-grained categories), and cross-lingual strategies for low-resource corpus construction. Text extraction leverages Google Document AI OCR followed by multi-stage post-processing—including layout restoration, code-mixing handling, and malformed token correction—and adopts normalization and annotation methodologies inspired by FarPaHC and CCOHA. This resource substantially extends the coverage and utility of SiDiaC-v1.0, providing a foundational dataset for Sinhala NLP.

Technology Category

Application Category

📝 Abstract
SiDiaC-v.2.0 is the largest comprehensive Sinhala Diachronic Corpus to date, covering a period from 1800 CE to 1955 CE in terms of publication dates, and a historical span from the 5th to the 20th century CE in terms of written dates. The corpus consists of 244k words across 185 literary works that underwent thorough filtering, preprocessing, and copyright compliance checks, followed by extensive post-processing. Additionally, a subset of 59 documents totalling 70k words was annotated based on their written dates. Texts from the National Library of Sri Lanka were selected from the SiDiaC-v.1.0 non-filtered list, which was digitised using Google Document AI OCR. This was followed by post-processing to correct formatting issues, address code-mixing, include special tokens, and fix malformed tokens. The construction of SiDiaC-v.2.0 was informed by practices from other corpora, such as FarPaHC, SiDiaC-v.1.0, and CCOHA. This was particularly relevant for syntactic annotation and text normalisation strategies, given the shared characteristics of low-resource language status between Faroese and the similar cleaning strategies utilised in CCOHA. This corpus is categorised into two layers based on genres: primary and secondary. The primary categorisation is binary, assigning each book to either Non-Fiction or Fiction. The secondary categorisation is more detailed, grouping texts under specific genres such as Religious, History, Poetry, Language, and Medical. Despite facing challenges due to limited resources, SiDiaC-v.2.0 serves as a comprehensive resource for Sinhala NLP, building upon the work previously done in SiDiaC-v.1.0.
Problem

Research questions and friction points this paper is trying to address.

Sinhala
diachronic corpus
low-resource language
NLP
historical texts
Innovation

Methods, ideas, or system contributions that make the work stand out.

diachronic corpus
low-resource NLP
text normalization
genre classification
OCR post-processing
🔎 Similar Papers
No similar papers found.
Nevidu Jayatilleke
Nevidu Jayatilleke
University of Moratuwa, Sri Lanka
Computational LinguisticsArtificial IntelligenceMachine Learning
Nisansa de Silva
Nisansa de Silva
Senior Lecturer, Department of Computer Science & Engineering, University of Moratuwa
Natural Language ProcessingArtificial IntelligenceMachine Learning
U
Uthpala Nimanthi
Research Department, Informatics Institute of Technology, Sri Lanka
G
Gagani Kulathilaka
Research Department, Informatics Institute of Technology, Sri Lanka
A
Azra Safrullah
Research Department, Informatics Institute of Technology, Sri Lanka
J
Johan Sofalas
Research Department, Informatics Institute of Technology, Sri Lanka