Multilingual Embedding Probes Fail to Generalize Across Learner Corpora

📅 2026-04-08

📈 Citations: 0

✨ Influential: 0

career value

140K/year

🤖 AI Summary

This study investigates whether multilingual embedding models encode generalizable cross-lingual representations of language proficiency. Leveraging intermediate-layer activations from Qwen3-Embedding models (0.6B/4B/8B), the authors employ five linear and nonlinear probing methods to predict CEFR proficiency levels of learners across seven languages and nine corpora, evaluating generalization both in-distribution and out-of-corpus. Results show strong in-distribution performance (QWK ≈ 0.7), substantially outperforming surface-feature baselines; however, probe accuracy drops markedly in cross-corpus settings, indicating that current embeddings primarily capture corpus-specific signals rather than universal dimensions of language ability. This work presents the first systematic assessment of cross-corpus transferability—and its limitations—of language proficiency representations in multilingual embeddings.

Technology Category

Application Category

📝 Abstract

Do multilingual embedding models encode a language-general representation of proficiency? We investigate this by training linear and non-linear probes on hidden-state activations from Qwen3-Embedding (0.6B, 4B, 8B) to predict CEFR proficiency levels from learner texts across nine corpora and seven languages. We compare five probing architectures against a baseline trained on surface-level text features. Under in-distribution evaluation, probes achieve strong performance ($QWK\approx0.7$), substantially outperforming the surface baseline, with middle layers consistently yielding the best predictions. However, in cross-corpus evaluation performance collapses across all probe types and model sizes. Residual analysis reveals that out-of-distribution probes converge towards predicting uniformly distributed labels, indicating that the learned mappings capture corpus-specific distributional properties (topic, language, task type, rating methodology) rather than an abstract, transferable proficiency dimension. These results suggest that current multilingual embeddings do not straightforwardly encode language-general proficiency, with implications for representation-based approaches to proficiency-adaptive language technology.

Problem

Research questions and friction points this paper is trying to address.

multilingual embeddings

language proficiency

cross-corpus generalization

learner corpora

representation probing

Innovation

Methods, ideas, or system contributions that make the work stand out.

multilingual embeddings

proficiency probing

cross-corpus generalization