🤖 AI Summary
This work addresses the fragmentation in research and applications of Turkmenic languages due to the lack of unified natural language processing (NLP) resources. To bridge this gap, we present an open-source Python library that, for the first time, provides integrated NLP support across four scripts: Latin, Cyrillic, Perso-Arabic, and Old Turkic runes. The system employs a modular architecture combining finite-state transducers and neural network models to enable automatic script detection, tokenization, morphological analysis, dependency parsing, named entity recognition, bidirectional transliteration, cross-lingual sentence embeddings, and machine translation. All outputs conform to the CoNLL-U standard to ensure interoperability and extensibility. The codebase and documentation are fully open-sourced to foster community adoption and further development.
📝 Abstract
Natural language processing for the Turkic language family, spoken by over 200 million people across Eurasia, remains fragmented, with most languages lacking unified tooling and resources. We present TurkicNLP, an open-source Python library providing a single, consistent NLP pipeline for Turkic languages across four script families: Latin, Cyrillic, Perso-Arabic, and Old Turkic Runic. The library covers tokenization, morphological analysis, part-of-speech tagging, dependency parsing, named entity recognition, bidirectional script transliteration, cross-lingual sentence embeddings, and machine translation through one language-agnostic API. A modular multi-backend architecture integrates rule-based finite-state transducers and neural models transparently, with automatic script detection and routing between script variants. Outputs follow the CoNLL-U standard for full interoperability and extension. Code and documentation are hosted at https://github.com/turkic-nlp/turkicnlp .