Transformers Pretrained on Procedural Data Contain Modular Structures for Algorithmic Reasoning

📅 2025-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates how procedural synthetic data pretraining induces transferable, modular algorithmic reasoning capabilities in small Transformers. To address this, we design multiple semantics-free generative rules and conduct partial-transfer experiments alongside systematic ablation studies. Our analysis reveals that distinct rules decouple and shape inductive biases across model components: attention layers primarily govern cross-task transferability, whereas specific rules substantially enhance the algorithmic generalization capacity of MLP blocks; multi-rule joint pretraining synergistically improves multi-task reasoning performance. Crucially, this work provides the first evidence that procedural pretraining enables architectural-level separation of knowledge acquisition (localized in attention mechanisms) and reasoning execution (localized in MLP modules). Furthermore, it demonstrates that synthetic data can encode modular algorithmic structures in model weights in a composable manner—thereby substantiating the feasibility of weight-level modular representation learning through carefully engineered procedural supervision.

Technology Category

Application Category

📝 Abstract
Pretraining on large, semantically rich datasets is key for developing language models. Surprisingly, recent studies have shown that even synthetic data, generated procedurally through simple semantic-free algorithms, can yield some of the same benefits as natural language pretraining. It is unclear what specific capabilities such simple synthetic data instils in a model, where these capabilities reside in the architecture, and how they manifest within its weights. In this short paper, we identify several beneficial forms of procedural data, together with specific algorithmic reasoning skills that improve in small transformers. Our core finding is that different procedural rules instil distinct but complementary inductive structures in the model. With extensive ablations and partial-transfer experiments, we discover that these structures reside in different parts of the model. Attention layers often carry the most transferable information, but some pretraining rules impart useful structure to MLP blocks instead. Most interestingly, the structures induced by multiple rules can be composed to jointly reinforce multiple capabilities. These results suggest an exciting possibility of disentangling the acquisition of knowledge from reasoning in language models, with the goal of improving their robustness and data efficiency.
Problem

Research questions and friction points this paper is trying to address.

Understanding capabilities from synthetic procedural data pretraining
Locating algorithmic reasoning structures in transformer models
Composing modular structures for multi-rule knowledge reinforcement
Innovation

Methods, ideas, or system contributions that make the work stand out.

Pretraining on synthetic procedural data
Modular structures in attention layers
Composable structures for multiple capabilities
🔎 Similar Papers
No similar papers found.
Z
Zachary Shinnick
Australian Institute for Machine Learning (AIML), University of Adelaide, Australia
L
Liangze Jiang
´Ecole Polytechnique F ´ed´erale de Lausanne (EPFL), Switzerland
Hemanth Saratchandran
Hemanth Saratchandran
Australian Institute for Machine Learning/Adelaide University + CommBank AI Scholar
MathematicsMachine Learning
A
A. Hengel
Australian Institute for Machine Learning (AIML), University of Adelaide, Australia
Damien Teney
Damien Teney
Idiap Research Institute
machine learningcomputer visionnatural language processingvision and language