Rethinking the Role of Text Complexity in Language Model Pretraining

📅 2025-09-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study systematically investigates how textual complexity affects pretraining of language models, addressing three key questions: (1) differential sensitivity of models of varying scales to text complexity; (2) whether simplified text can support effective representation learning; and (3) how complexity differentially impacts downstream performance in fine-tuning versus zero-shot settings. Methodologically, we employ large language models to automatically simplify real-world texts while preserving semantics, then conduct de novo pretraining of causal language models (28M–500M parameters) on both original and simplified corpora, evaluating them across diverse language understanding tasks. Key findings show that simplification substantially reduces perplexity for smaller models and improves zero-shot performance on linguistic knowledge tasks, whereas complex texts better support world knowledge acquisition and entity tracking. Fine-tuning performance remains largely insensitive to complexity. This work is the first to reveal a strong interaction between model capacity and textual complexity, challenging the prevailing assumption that higher textual complexity inherently implies higher-quality pretraining data.

Technology Category

Application Category

📝 Abstract
Improving pretraining data quality and size is known to boost downstream performance, but the role of text complexity is less explored. Text complexity refers to how hard a text is to read, and is typically estimated from surface cues such as sentence length, word choice, and sentence structure. We reduce surface-level complexity--shorter sentences, simpler words, simpler structure--while keeping core text content close to constant, and ask: (1) How does complexity affect language modeling across model sizes? (2) Can useful representations be learned from simpler text alone? (3) How does pretraining text complexity influence downstream language understanding? To answer these questions, we simplify human-written texts using a large language model, then pretrain causal models (28M-500M) from scratch on both original and simplified data, and evaluate them in finetuning and zero-shot setups. We find that perplexity is sensitive to the interaction between model capacity and text complexity--smaller models degrade far less on simpler texts--while text complexity has little impact on finetuning evaluations, with zero-shot evaluations indicating that simpler texts benefit performance on linguistic knowledge tasks, whereas more complex texts favor tasks requiring world knowledge and entity tracking.
Problem

Research questions and friction points this paper is trying to address.

Investigating how text complexity affects language model pretraining across different model sizes
Determining if useful representations can be learned from simplified text alone
Examining how pretraining text complexity influences downstream language understanding tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Simplified text complexity using language model
Pretrained causal models on simplified data
Evaluated model performance across complexity levels
🔎 Similar Papers
No similar papers found.
Dan John Velasco
Dan John Velasco
Samsung Research Philippines
Natural Language ProcessingDeep Learning
M
Matthew Theodore Roque
Samsung R&D Institute Philippines