Readability $ e$ Learnability: Rethinking the Role of Simplicity in Training Small Language Models

📅 2025-10-15

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This study challenges the prevailing hypothesis that improving text readability enhances the coherent generation capability of small language models (SLMs). Method: We construct a synthetically generated dataset with controlled readability and consistent structural properties, and conduct systematic ablation experiments using controlled-variable training to investigate the relationship between linguistic complexity and model learnability. Contribution/Results: (1) Statistical concision—measured by n-gram diversity—proves more predictive of SLM learning efficiency than conventional readability metrics; (2) SLMs trained on complex, adult-targeted corpora achieve coherence development rates and final performance comparable to—or even surpassing—those trained on simplified corpora. This work provides the first empirical challenge to the paradigm that directly analogizes child language acquisition mechanisms to SLM training, demonstrating that lexical and syntactic simplification is not a necessary condition for efficient learning in small models.

Technology Category

Application Category

📝 Abstract

Recent studies suggest that very small language models (SLMs) can generate surprisingly coherent text when trained on simplified, child-directed corpora such as TinyStories. These findings have been interpreted as evidence that readability -- characterized by accessible vocabulary, familiar narrative structure, and simple syntax -- plays a key role in enabling such capabilities to emerge. In this paper, we challenge that interpretation. We construct synthetic datasets with matched structure but varied readability, and find that readability alone does not predict coherence or learning efficiency in SLMs. Models trained on complex, adult-level text perform comparably to those trained on simplified language, and even exhibit faster development of coherence during training. Instead, we show that statistical simplicity, as measured by n-gram diversity, is a stronger predictor of learnability. Our findings caution against the growing trend of anthropomorphizing language model training -- drawing parallels to human cognitive development without empirical basis -- and argue for more precise reasoning about what properties actually support capability emergence in small models.

Problem

Research questions and friction points this paper is trying to address.

Challenging readability's role in small model training efficiency

Comparing synthetic datasets with varied complexity for SLMs

Identifying statistical simplicity as key learnability predictor

Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthetic datasets with matched structure test readability

Statistical simplicity measured by n-gram diversity predicts learnability

Complex adult-level text trains models comparably to simplified language

🔎 Similar Papers

What Languages are Easy to Language-Model? A Perspective from Learning Probabilistic Regular Languages