Kr'eyoLID From Language Identification Towards Language Mining

📅 2025-03-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
解决低资源语言自动识别问题,提出语言挖掘视角,通过新流程快速构建语料库,应用于法语克里奥尔语。

Technology Category

Application Category

📝 Abstract
Automatic language identification is frequently framed as a multi-class classification problem. However, when creating digital corpora for less commonly written languages, it may be more appropriate to consider it a data mining problem. For these varieties, one knows ahead of time that the vast majority of documents are of little interest. By minimizing resources spent on classifying such documents, we can create corpora much faster and with better coverage than using established pipelines. To demonstrate the effectiveness of the language mining perspective, we introduce a new pipeline and corpora for several French-based Creoles.
Problem

Research questions and friction points this paper is trying to address.

Automatic language identification as data mining
Creating digital corpora for less common languages
Minimizing resources on uninteresting documents
Innovation

Methods, ideas, or system contributions that make the work stand out.

Treats language identification as data mining
Focuses on less commonly written languages
Introduces new pipeline for French-based Creoles
🔎 Similar Papers
No similar papers found.