HiTZ at VarDial 2025 NorSID: Overcoming Data Scarcity with Language Transfer and Automatic Data Annotation

📅 2024-12-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address performance bottlenecks in low-resource Norwegian dialectal NLU tasks—specifically intent classification, slot filling, and dialect identification—this work proposes a joint language transfer and domain adaptation framework. Building upon XLM-R, it leverages multilingual xSID data for cross-lingual transfer and performs domain-adaptive fine-tuning specifically on dialect identification. This study presents the first systematic evaluation of cross-lingual pretraining’s generalization capability and domain specificity for under-resourced dialectal tasks. Through multi-task fine-tuning, data distribution consistency analysis, and ablation studies, the model achieves a 2.3% improvement in F1 score for intent classification and slot filling on the NorSID test set over baseline models, while attaining 89.7% accuracy on dialect identification—ranking among the top three on the official VarDial 2025 NorSID leaderboard.

Technology Category

Application Category

📝 Abstract
In this paper we present our submission for the NorSID Shared Task as part of the 2025 VarDial Workshop (Scherrer et al., 2025), consisting of three tasks: Intent Detection, Slot Filling and Dialect Identification, evaluated using data in different dialects of the Norwegian language. For Intent Detection and Slot Filling, we have fine-tuned a multitask model in a cross-lingual setting, to leverage the xSID dataset available in 17 languages. In the case of Dialect Identification, our final submission consists of a model fine-tuned on the provided development set, which has obtained the highest scores within our experiments. Our final results on the test set show that our models do not drop in performance compared to the development set, likely due to the domain-specificity of the dataset and the similar distribution of both subsets. Finally, we also report an in-depth analysis of the provided datasets and their artifacts, as well as other sets of experiments that have been carried out but did not yield the best results. Additionally, we present an analysis on the reasons why some methods have been more successful than others; mainly the impact of the combination of languages and domain-specificity of the training data on the results.
Problem

Research questions and friction points this paper is trying to address.

Intention Recognition
Slot Filling
Dialect Identification
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-task Learning Model
Cross-lingual Generalization
Automatic Data Annotation
🔎 Similar Papers
No similar papers found.
Jaione Bengoetxea
Jaione Bengoetxea
HiTZ Basque Center for Language Technology - Ixa, University of the Basque Country UPV/EHU
M
Mikel Zubillaga
HiTZ Center – Ixa, University of the Basque Country (UPV/EHU)
E
Ekhi Azurmendi
HiTZ Center – Ixa, University of the Basque Country (UPV/EHU)
Maite Heredia
Maite Heredia
PhD student, IXA, EHU
Julen Etxaniz
Julen Etxaniz
PhD Student in NLP, HiTZ, University of the Basque Country
MultilingualityNLPDLMLAI
M
Markel Ferro
HiTZ Center – Ixa, University of the Basque Country (UPV/EHU)
J
Jeremy Barnes
HiTZ Center – Ixa, University of the Basque Country (UPV/EHU)