Training Language Models via Neural Cellular Automata

📅 2026-03-09

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This work addresses key limitations in natural language pretraining—such as data scarcity, human biases, and the entanglement of knowledge with reasoning—by proposing a novel approach that leverages neural cellular automata (NCA) to generate synthetic, non-linguistic data exhibiting language-like statistical properties. The method introduces NCA-generated sequences for pre-pretraining large language models before fine-tuning on real textual data. This is the first application of NCAs to language model pretraining, offering scalable, controllable synthetic data that can be tailored in complexity to specific target domains. Using only 164 million NCA tokens for pre-pretraining, the approach achieves up to a 6% improvement in downstream language modeling performance, accelerates convergence by 1.6×, and significantly outperforms baselines on reasoning benchmarks including GSM8K, HumanEval, and BigBench-Lite.

Technology Category

Application Category

📝 Abstract

Pre-training is crucial for large language models (LLMs), as it is when most representations and capabilities are acquired. However, natural language pre-training has problems: high-quality text is finite, it contains human biases, and it entangles knowledge with reasoning. This raises a fundamental question: is natural language the only path to intelligence? We propose using neural cellular automata (NCA) to generate synthetic, non-linguistic data for pre-pre-training LLMs--training on synthetic-then-natural language. NCA data exhibits rich spatiotemporal structure and statistics resembling natural language while being controllable and cheap to generate at scale. We find that pre-pre-training on only 164M NCA tokens improves downstream language modeling by up to 6% and accelerates convergence by up to 1.6x. Surprisingly, this even outperforms pre-pre-training on 1.6B tokens of natural language from Common Crawl with more compute. These gains also transfer to reasoning benchmarks, including GSM8K, HumanEval, and BigBench-Lite. Investigating what drives transfer, we find that attention layers are the most transferable, and that optimal NCA complexity varies by domain: code benefits from simpler dynamics, while math and web text favor more complex ones. These results enable systematic tuning of the synthetic distribution to target domains. More broadly, our work opens a path toward more efficient models with fully synthetic pre-training.

Problem

Research questions and friction points this paper is trying to address.

pre-training

natural language

bias

knowledge-reasoning entanglement

data scarcity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Neural Cellular Automata

synthetic pre-training

pre-pre-training