Kr'eyoLID From Language Identification Towards Language Mining

📅 2025-03-09

📈 Citations: 0

✨ Influential: 0

📄 PDF

career value

190K/year

🤖 AI Summary

解决低资源语言自动识别问题，提出语言挖掘视角，通过新流程快速构建语料库，应用于法语克里奥尔语。

Technology Category

Application Category

📝 Abstract

Automatic language identification is frequently framed as a multi-class classification problem. However, when creating digital corpora for less commonly written languages, it may be more appropriate to consider it a data mining problem. For these varieties, one knows ahead of time that the vast majority of documents are of little interest. By minimizing resources spent on classifying such documents, we can create corpora much faster and with better coverage than using established pipelines. To demonstrate the effectiveness of the language mining perspective, we introduce a new pipeline and corpora for several French-based Creoles.

Problem

Research questions and friction points this paper is trying to address.

Automatic language identification as data mining

Creating digital corpora for less common languages

Minimizing resources on uninteresting documents

Innovation

Methods, ideas, or system contributions that make the work stand out.

Treats language identification as data mining

Focuses on less commonly written languages

Introduces new pipeline for French-based Creoles

🔎 Similar Papers

No similar papers found.