BLiSS 1.0: Evaluating Bilingual Learner Competence in Second Language Small Language Models

📅 2025-10-22

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Existing benchmarks inadequately assess language models’ capacity to model second language acquisition (SLA) principles. Method: We introduce BLiSS 1.0, a novel benchmark built on 2.8 million authentic sentences from bilingual learners. It pioneers a “selective tolerance” evaluation paradigm, organizing inputs into triplets—grammatically correct sentences, naturally occurring learner errors, and manually crafted ungrammatical sentences—to isolate models’ ability to distinguish acquisition-consistent errors from arbitrary violations. Evaluation employs grammaticality acceptability scoring and clustering analysis. Contribution/Results: We find selective tolerance is empirically distinct from conventional grammaticality judgment; moreover, model performance clusters clearly by training objective (e.g., pretraining vs. instruction tuning). BLiSS 1.0 enables fine-grained, goal-sensitive assessment of models’ SLA simulation capability for the first time, establishing a cognitively grounded standard for evaluating language models.

Technology Category

Application Category

📝 Abstract

To bridge the gap between performance-oriented benchmarks and the evaluation of cognitively inspired models, we introduce BLiSS 1.0, a Benchmark of Learner Interlingual Syntactic Structure. Our benchmark operationalizes a new paradigm of selective tolerance, testing whether a model finds a naturalistic learner error more plausible than a matched, artificial error within the same sentence. Constructed from over 2.8 million naturalistic learner sentences, BLiSS provides 136,867 controlled triplets (corrected, learner, artificial) for this purpose. Experiments on a diverse suite of models demonstrate that selective tolerance is a distinct capability from standard grammaticality, with performance clustering strongly by training paradigm. This validates BLiSS as a robust tool for measuring how different training objectives impact a model's alignment with the systematic patterns of human language acquisition.

Problem

Research questions and friction points this paper is trying to address.

Evaluating bilingual learner competence in second language models

Bridging performance benchmarks with cognitive model evaluation

Measuring training objectives' impact on human acquisition patterns

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces BLiSS benchmark for bilingual evaluation

Uses selective tolerance paradigm for error plausibility

Leverages naturalistic learner sentences for controlled triplets

🔎 Similar Papers

No similar papers found.