🤖 AI Summary
This study investigates whether multilingual embedding models encode generalizable cross-lingual representations of language proficiency. Leveraging intermediate-layer activations from Qwen3-Embedding models (0.6B/4B/8B), the authors employ five linear and nonlinear probing methods to predict CEFR proficiency levels of learners across seven languages and nine corpora, evaluating generalization both in-distribution and out-of-corpus. Results show strong in-distribution performance (QWK ≈ 0.7), substantially outperforming surface-feature baselines; however, probe accuracy drops markedly in cross-corpus settings, indicating that current embeddings primarily capture corpus-specific signals rather than universal dimensions of language ability. This work presents the first systematic assessment of cross-corpus transferability—and its limitations—of language proficiency representations in multilingual embeddings.
📝 Abstract
Do multilingual embedding models encode a language-general representation of proficiency? We investigate this by training linear and non-linear probes on hidden-state activations from Qwen3-Embedding (0.6B, 4B, 8B) to predict CEFR proficiency levels from learner texts across nine corpora and seven languages. We compare five probing architectures against a baseline trained on surface-level text features. Under in-distribution evaluation, probes achieve strong performance ($QWK\approx0.7$), substantially outperforming the surface baseline, with middle layers consistently yielding the best predictions. However, in cross-corpus evaluation performance collapses across all probe types and model sizes. Residual analysis reveals that out-of-distribution probes converge towards predicting uniformly distributed labels, indicating that the learned mappings capture corpus-specific distributional properties (topic, language, task type, rating methodology) rather than an abstract, transferable proficiency dimension. These results suggest that current multilingual embeddings do not straightforwardly encode language-general proficiency, with implications for representation-based approaches to proficiency-adaptive language technology.