🤖 AI Summary
Traditional neural networks suffer from fixed computation paths, limiting their ability to simultaneously achieve efficiency and flexibility. To address this, we propose Distributed Neural Architecture (DNA), a modular and fully distributed dynamic neural network. DNA employs an end-to-end learnable routing mechanism that enables each token to adaptively select its execution path across modules, thereby achieving content-driven sparse activation and load balancing. The resulting computation paths follow a power-law distribution, promoting functional specialization of modules and yielding interpretable, task-aware computational allocation. Built upon Transformer, MLP, and attention modules, DNA matches the performance of dense models at equivalent parameter counts on both vision and language tasks. Moreover, it improves data-driven inference efficiency, parameter sharing rate, and architectural interpretability—without compromising accuracy.
📝 Abstract
We introduce and train distributed neural architectures (DNA) in vision and language domains. DNAs are initialized with a proto-architecture that consists of (transformer, MLP, attention, etc.) modules and routers. Any token (or patch) can traverse any series of modules in any order. DNAs are a natural generalization of the sparse methods such as Mixture-of-Experts, Mixture-of-Depths, parameter sharing, etc. Computation and communication patterns of DNA modules are learnt end-to-end during training and depend on the content and context of each token (or patch). These patterns can be shaped by further requirements added to the optimization objective such as compute/memory efficiency or load balancing. We empirically show that (i) trained DNAs are competitive with the dense baselines in both domains and (ii) compute efficiency/parameter sharing can be learnt from data. Next, we analyze the emergent connectivity and computation patterns in the trained DNAs. We find that the paths that tokens take through the models are themselves distributed according to a power-law. We show that some paths (or, equivalently, groups of modules) show emergent specialization. Finally, we demonstrate that models learn to allocate compute and active parameters in an interpretable way.