Multi-label Scandinavian Language Identification (SLIDE)

📅 2025-02-10

📈 Citations: 0

✨ Influential: 0

career value

154K/year

🤖 AI Summary

This paper addresses the challenge of fine-grained, sentence-level multilabel language identification (LID) for Scandinavian languages—Danish, Norwegian Bokmål, Norwegian Nynorsk, and Swedish—where sentences frequently exhibit intra-sentential code-mixing, rendering single-label classification inadequate. To tackle this, we introduce SLIDE, the first manually annotated multilabel evaluation dataset for Scandinavian LID. We propose a lightweight, multilayer neural architecture leveraging character- and token-level features, integrated with threshold optimization and label-correlation modeling to enable tunable precision–efficiency trade-offs. Experiments demonstrate that multilabel modeling is essential for accurate LID: our method achieves a mean F1-score of 89.2% on SLIDE, substantially outperforming single-label baselines. The lightweight variant processes over 10,000 sentences per second, satisfying both industrial deployment constraints and academic evaluation rigor.

Technology Category

Application Category

📝 Abstract

Identifying closely related languages at sentence level is difficult, in particular because it is often impossible to assign a sentence to a single language. In this paper, we focus on multi-label sentence-level Scandinavian language identification (LID) for Danish, Norwegian Bokm {a}l, Norwegian Nynorsk, and Swedish. We present the Scandinavian Language Identification and Evaluation, SLIDE, a manually curated multi-label evaluation dataset and a suite of LID models with varying speed-accuracy tradeoffs. We demonstrate that the ability to identify multiple languages simultaneously is necessary for any accurate LID method, and present a novel approach to training such multi-label LID models.

Problem

Research questions and friction points this paper is trying to address.

Identify multiple Scandinavian languages simultaneously.

Develop multi-label language identification models.

Create a dataset for evaluating language identification.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-label language identification

Manually curated dataset

Novel training approach

🔎 Similar Papers

No similar papers found.