🤖 AI Summary
Existing benchmarks inadequately assess language models’ capacity to model second language acquisition (SLA) principles.
Method: We introduce BLiSS 1.0, a novel benchmark built on 2.8 million authentic sentences from bilingual learners. It pioneers a “selective tolerance” evaluation paradigm, organizing inputs into triplets—grammatically correct sentences, naturally occurring learner errors, and manually crafted ungrammatical sentences—to isolate models’ ability to distinguish acquisition-consistent errors from arbitrary violations. Evaluation employs grammaticality acceptability scoring and clustering analysis.
Contribution/Results: We find selective tolerance is empirically distinct from conventional grammaticality judgment; moreover, model performance clusters clearly by training objective (e.g., pretraining vs. instruction tuning). BLiSS 1.0 enables fine-grained, goal-sensitive assessment of models’ SLA simulation capability for the first time, establishing a cognitively grounded standard for evaluating language models.
📝 Abstract
To bridge the gap between performance-oriented benchmarks and the evaluation of cognitively inspired models, we introduce BLiSS 1.0, a Benchmark of Learner Interlingual Syntactic Structure. Our benchmark operationalizes a new paradigm of selective tolerance, testing whether a model finds a naturalistic learner error more plausible than a matched, artificial error within the same sentence. Constructed from over 2.8 million naturalistic learner sentences, BLiSS provides 136,867 controlled triplets (corrected, learner, artificial) for this purpose. Experiments on a diverse suite of models demonstrate that selective tolerance is a distinct capability from standard grammaticality, with performance clustering strongly by training paradigm. This validates BLiSS as a robust tool for measuring how different training objectives impact a model's alignment with the systematic patterns of human language acquisition.