BERnaT: Basque Encoders for Representing Natural Textual Diversity

📅 2025-12-03

📈 Citations: 0

✨ Influential: 0

career value

154K/year

🤖 AI Summary

Language models often exhibit representational bias and reduced robustness due to overreliance on high-quality, standardized corpora, neglecting dialectal, historical, and informal linguistic variants. To address this, we propose *full-spectrum language modeling*, instantiated with Basque as a case study. We construct a multilingual, multi-source pretraining dataset integrating standardized texts, social media content, and historical corpora, and train three Transformer-based encoder configurations. Our key contribution is a novel hierarchical evaluation framework that—uniquely—partitions NLU tasks into standard and diverse subsets, enabling systematic quantification of models’ capacity to capture linguistic variation. Experiments demonstrate that models trained on mixed-domain data achieve significant performance gains on diverse linguistic contexts while maintaining competitive accuracy on standard benchmarks. This validates both the efficacy and necessity of explicitly modeling linguistic diversity in language model pretraining.

Technology Category

Application Category

📝 Abstract

Language models depend on massive text corpora that are often filtered for quality, a process that can unintentionally exclude non-standard linguistic varieties, reduce model robustness and reinforce representational biases. In this paper, we argue that language models should aim to capture the full spectrum of language variation (dialectal, historical, informal, etc.) rather than relying solely on standardized text. Focusing on Basque, a morphologically rich and low-resource language, we construct new corpora combining standard, social media, and historical sources, and pre-train the BERnaT family of encoder-only models in three configurations: standard, diverse, and combined. We further propose an evaluation framework that separates Natural Language Understanding (NLU) tasks into standard and diverse subsets to assess linguistic generalization. Results show that models trained on both standard and diverse data consistently outperform those trained on standard corpora, improving performance across all task types without compromising standard benchmark accuracy. These findings highlight the importance of linguistic diversity in building inclusive, generalizable language models.

Problem

Research questions and friction points this paper is trying to address.

Addresses exclusion of non-standard linguistic varieties in language models.

Focuses on capturing full language variation for model robustness.

Evaluates models on diverse subsets to assess linguistic generalization.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines standard, social media, historical corpora for Basque

Pre-trains encoder models in standard, diverse, combined configurations

Evaluates NLU tasks with standard and diverse subsets framework

🔎 Similar Papers

Standardizing the Measurement of Text Diversity: A Tool and a Comparative Analysis of Scores