UniversalCEFR: Enabling Open Multilingual Research on Language Proficiency Assessment

📅 2025-06-02

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

The absence of large-scale, standardized, multilingual CEFR-aligned datasets hinders reproducible research and global collaboration in language proficiency assessment. To address this, we introduce the first open-source, multilingual CEFR-annotated dataset—covering 13 languages and 505,000 texts—formatted uniformly to support cross-lingual and cross-task evaluation. Methodologically, we integrate three complementary modeling paradigms: (i) linguistically informed feature-based classification, (ii) fine-tuning of multilingual pretrained language models, and (iii) instruction-tuned large language models used as descriptive profilers via prompt engineering. Experimental results demonstrate the efficacy of both linguistic priors and fine-tuning approaches across languages, establishing a new open, reproducible benchmark for CEFR-level assessment. This work advances standardization in language assessment data curation and methodology, facilitating principled, scalable, and comparable proficiency evaluation.

Technology Category

Application Category

📝 Abstract

We introduce UniversalCEFR, a large-scale multilingual multidimensional dataset of texts annotated according to the CEFR (Common European Framework of Reference) scale in 13 languages. To enable open research in both automated readability and language proficiency assessment, UniversalCEFR comprises 505,807 CEFR-labeled texts curated from educational and learner-oriented resources, standardized into a unified data format to support consistent processing, analysis, and modeling across tasks and languages. To demonstrate its utility, we conduct benchmark experiments using three modelling paradigms: a) linguistic feature-based classification, b) fine-tuning pre-trained LLMs, and c) descriptor-based prompting of instruction-tuned LLMs. Our results further support using linguistic features and fine-tuning pretrained models in multilingual CEFR level assessment. Overall, UniversalCEFR aims to establish best practices in data distribution in language proficiency research by standardising dataset formats and promoting their accessibility to the global research community.

Problem

Research questions and friction points this paper is trying to address.

Creating a multilingual dataset for language proficiency assessment

Standardizing data formats for consistent cross-language analysis

Evaluating modeling approaches for CEFR level prediction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale multilingual CEFR-labeled dataset

Standardized unified data format

Linguistic features and fine-tuned LLMs

🔎 Similar Papers

No similar papers found.