Language Acquisition Device in Large Language Models

📅 2026-05-15

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This work addresses the stark data inefficiency of large language models compared to human learners in acquiring linguistic structure. Inspired by the Language Acquisition Device (LAD) hypothesis, the authors propose pre-pretraining on a synthetic formal language, MP-STRUCT, whose core variant—MP-STRUCT CORE—exceeds the formal expressive capacity of Transformers yet substantially enhances learning efficiency. The framework integrates MERGE, AGREE, and MOVE operations, incorporates functional landmarks to reduce dependency parsing ambiguity, and explicitly models hierarchical compositionality, feature agreement, and long-distance displacement. Remarkably, just 500 steps of pre-pretraining match the token efficiency of strong baselines and confer human-like robustness to unnatural constructions such as REVERSE, highlighting the critical role of dependency parsing accessibility in effective pre-pretraining design.

📝 Abstract

Large Language Models (LLMs) remain substantially less data-efficient than humans. Pre-pretraining (PPT) on synthetic languages has been proposed to close this gap, with prior work emphasizing highly expressive formal languages such as $k$-Shuffle Dyck. Inspired by the Language Acquisition Device (LAD) hypothesis, which posits that innate constraints preemptively restrict the learner's hypothesis space to natural-language-like structure, we propose LAD-inspired PPT: pre-pretraining on MP-STRUCT, a formal language whose strings encode hierarchical composition, feature-based dependencies, and long-distance displacement via MERGE, AGREE, and MOVE. A brief 500-step PPT with MP-STRUCT matches strong formal-language baselines in token efficiency while additionally imparting a human-like resistance to structurally implausible languages (e.g., REVERSE). Analyzing simplified variants, we find that MP-STRUCT CORE outperforms $k$-Shuffle Dyck despite not being definable in C-RASP (a formal bound on transformer expressivity), challenging the prior hypothesis that effective PPT languages must be both hierarchically expressive and circuit-theoretically learnable. We show that functional landmarks, which reduce dependency resolution ambiguity, are a key driver, suggesting that effective PPT design depends not only on expressivity but also on the accessibility of dependency resolution.

Problem

Research questions and friction points this paper is trying to address.

Language Acquisition Device

pre-pretraining

data efficiency

synthetic languages

dependency resolution

Innovation

Methods, ideas, or system contributions that make the work stand out.

Language Acquisition Device

pre-pretraining

MP-STRUCT