Transformers Pretrained on Procedural Data Contain Modular Structures for Algorithmic Reasoning

📅 2025-05-28

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

This study investigates how procedural synthetic data pretraining induces transferable, modular algorithmic reasoning capabilities in small Transformers. To address this, we design multiple semantics-free generative rules and conduct partial-transfer experiments alongside systematic ablation studies. Our analysis reveals that distinct rules decouple and shape inductive biases across model components: attention layers primarily govern cross-task transferability, whereas specific rules substantially enhance the algorithmic generalization capacity of MLP blocks; multi-rule joint pretraining synergistically improves multi-task reasoning performance. Crucially, this work provides the first evidence that procedural pretraining enables architectural-level separation of knowledge acquisition (localized in attention mechanisms) and reasoning execution (localized in MLP modules). Furthermore, it demonstrates that synthetic data can encode modular algorithmic structures in model weights in a composable manner—thereby substantiating the feasibility of weight-level modular representation learning through carefully engineered procedural supervision.

Technology Category

Application Category

📝 Abstract

Pretraining on large, semantically rich datasets is key for developing language models. Surprisingly, recent studies have shown that even synthetic data, generated procedurally through simple semantic-free algorithms, can yield some of the same benefits as natural language pretraining. It is unclear what specific capabilities such simple synthetic data instils in a model, where these capabilities reside in the architecture, and how they manifest within its weights. In this short paper, we identify several beneficial forms of procedural data, together with specific algorithmic reasoning skills that improve in small transformers. Our core finding is that different procedural rules instil distinct but complementary inductive structures in the model. With extensive ablations and partial-transfer experiments, we discover that these structures reside in different parts of the model. Attention layers often carry the most transferable information, but some pretraining rules impart useful structure to MLP blocks instead. Most interestingly, the structures induced by multiple rules can be composed to jointly reinforce multiple capabilities. These results suggest an exciting possibility of disentangling the acquisition of knowledge from reasoning in language models, with the goal of improving their robustness and data efficiency.

Problem

Research questions and friction points this paper is trying to address.

Understanding capabilities from synthetic procedural data pretraining

Locating algorithmic reasoning structures in transformer models

Composing modular structures for multi-rule knowledge reinforcement

Innovation

Methods, ideas, or system contributions that make the work stand out.

Pretraining on synthetic procedural data

Modular structures in attention layers

Composable structures for multiple capabilities

🔎 Similar Papers

No similar papers found.