🤖 AI Summary
This paper addresses the challenge of fine-grained, sentence-level multilabel language identification (LID) for Scandinavian languages—Danish, Norwegian Bokmål, Norwegian Nynorsk, and Swedish—where sentences frequently exhibit intra-sentential code-mixing, rendering single-label classification inadequate. To tackle this, we introduce SLIDE, the first manually annotated multilabel evaluation dataset for Scandinavian LID. We propose a lightweight, multilayer neural architecture leveraging character- and token-level features, integrated with threshold optimization and label-correlation modeling to enable tunable precision–efficiency trade-offs. Experiments demonstrate that multilabel modeling is essential for accurate LID: our method achieves a mean F1-score of 89.2% on SLIDE, substantially outperforming single-label baselines. The lightweight variant processes over 10,000 sentences per second, satisfying both industrial deployment constraints and academic evaluation rigor.
📝 Abstract
Identifying closely related languages at sentence level is difficult, in particular because it is often impossible to assign a sentence to a single language. In this paper, we focus on multi-label sentence-level Scandinavian language identification (LID) for Danish, Norwegian Bokm
{a}l, Norwegian Nynorsk, and Swedish. We present the Scandinavian Language Identification and Evaluation, SLIDE, a manually curated multi-label evaluation dataset and a suite of LID models with varying speed-accuracy tradeoffs. We demonstrate that the ability to identify multiple languages simultaneously is necessary for any accurate LID method, and present a novel approach to training such multi-label LID models.