UniversalCEFR: Enabling Open Multilingual Research on Language Proficiency Assessment

📅 2025-06-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The absence of large-scale, standardized, multilingual CEFR-aligned datasets hinders reproducible research and global collaboration in language proficiency assessment. To address this, we introduce the first open-source, multilingual CEFR-annotated dataset—covering 13 languages and 505,000 texts—formatted uniformly to support cross-lingual and cross-task evaluation. Methodologically, we integrate three complementary modeling paradigms: (i) linguistically informed feature-based classification, (ii) fine-tuning of multilingual pretrained language models, and (iii) instruction-tuned large language models used as descriptive profilers via prompt engineering. Experimental results demonstrate the efficacy of both linguistic priors and fine-tuning approaches across languages, establishing a new open, reproducible benchmark for CEFR-level assessment. This work advances standardization in language assessment data curation and methodology, facilitating principled, scalable, and comparable proficiency evaluation.

Technology Category

Application Category

📝 Abstract
We introduce UniversalCEFR, a large-scale multilingual multidimensional dataset of texts annotated according to the CEFR (Common European Framework of Reference) scale in 13 languages. To enable open research in both automated readability and language proficiency assessment, UniversalCEFR comprises 505,807 CEFR-labeled texts curated from educational and learner-oriented resources, standardized into a unified data format to support consistent processing, analysis, and modeling across tasks and languages. To demonstrate its utility, we conduct benchmark experiments using three modelling paradigms: a) linguistic feature-based classification, b) fine-tuning pre-trained LLMs, and c) descriptor-based prompting of instruction-tuned LLMs. Our results further support using linguistic features and fine-tuning pretrained models in multilingual CEFR level assessment. Overall, UniversalCEFR aims to establish best practices in data distribution in language proficiency research by standardising dataset formats and promoting their accessibility to the global research community.
Problem

Research questions and friction points this paper is trying to address.

Creating a multilingual dataset for language proficiency assessment
Standardizing data formats for consistent cross-language analysis
Evaluating modeling approaches for CEFR level prediction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale multilingual CEFR-labeled dataset
Standardized unified data format
Linguistic features and fine-tuned LLMs
🔎 Similar Papers
No similar papers found.
J
Joseph Marvin Imperial
University of Bath, National University Philippines
Abdullah Barayan
Abdullah Barayan
Cardiff University
Regina Stodden
Regina Stodden
Postdoctoral Researcher @ Uni Bielefeld
Natural Language ProcessingText SimplificationHuman Evaluation
Rodrigo Wilkens
Rodrigo Wilkens
University of Exeter
Natural Language ProcessingLanguage AcquisitionReadabilityText Simplification
R
Ricardo Munoz Sanchez
University of Gothenburg
L
Lingyun Gao
UCLouvain
Melissa Torgbi
Melissa Torgbi
University of Bath
Large Language Models
Dawn Knight
Dawn Knight
Professor in Applied Linguistics, Cardiff University, UK
Corpus LinguisticsPragmaticsDiscourse AnalysisMultimodalityE-Language
G
Gail Forey
University of Bath
R
Reka R. Jablonkai
University of Bath
Ekaterina Kochmar
Ekaterina Kochmar
Assistant Professor, Natural Language Processing Department, MBZUAI
Natural Language ProcessingMachine LearningArtificial Intelligence in Education
R
Robert Reynolds
Brigham Young University
E
Eugenio Ribeiro
INESC-ID Lisboa, Instituto Universitário de Lisboa (ISCTE-IUL)
Horacio Saggion
Horacio Saggion
Chair in Computer Science & Artificial Intelligence, Universitat Pompeu Fabra, DTIC. Head of TALN.
Natural Language ProcessingArtificial IntelligenceComputer Science
Elena Volodina
Elena Volodina
Professor, University of Gothenburg
NLPlanguage technologyIntelligent Computer-Assisted Language LearningCorpus Linguistics
Sowmya Vajjala
Sowmya Vajjala
National Research Council, Canada
Natural Language Processing
Thomas François
Thomas François
Associate Professor at Université catholique de Louvain
Applied LinguisticsNLPReadabilityText SimplificationAutomated Essay Scoring
Fernando Alva-Manchego
Fernando Alva-Manchego
Cardiff University
Text SimplificationReadability AssessmentText AdaptationEducational NLPNatural Language Processing
H
H. T. Madabushi
University of Bath