Towards Distributed Neural Architectures

📅 2025-06-27

📈 Citations: 0

✨ Influential: 0

career value

235K/year

🤖 AI Summary

Traditional neural networks suffer from fixed computation paths, limiting their ability to simultaneously achieve efficiency and flexibility. To address this, we propose Distributed Neural Architecture (DNA), a modular and fully distributed dynamic neural network. DNA employs an end-to-end learnable routing mechanism that enables each token to adaptively select its execution path across modules, thereby achieving content-driven sparse activation and load balancing. The resulting computation paths follow a power-law distribution, promoting functional specialization of modules and yielding interpretable, task-aware computational allocation. Built upon Transformer, MLP, and attention modules, DNA matches the performance of dense models at equivalent parameter counts on both vision and language tasks. Moreover, it improves data-driven inference efficiency, parameter sharing rate, and architectural interpretability—without compromising accuracy.

Technology Category

Application Category

📝 Abstract

We introduce and train distributed neural architectures (DNA) in vision and language domains. DNAs are initialized with a proto-architecture that consists of (transformer, MLP, attention, etc.) modules and routers. Any token (or patch) can traverse any series of modules in any order. DNAs are a natural generalization of the sparse methods such as Mixture-of-Experts, Mixture-of-Depths, parameter sharing, etc. Computation and communication patterns of DNA modules are learnt end-to-end during training and depend on the content and context of each token (or patch). These patterns can be shaped by further requirements added to the optimization objective such as compute/memory efficiency or load balancing. We empirically show that (i) trained DNAs are competitive with the dense baselines in both domains and (ii) compute efficiency/parameter sharing can be learnt from data. Next, we analyze the emergent connectivity and computation patterns in the trained DNAs. We find that the paths that tokens take through the models are themselves distributed according to a power-law. We show that some paths (or, equivalently, groups of modules) show emergent specialization. Finally, we demonstrate that models learn to allocate compute and active parameters in an interpretable way.

Problem

Research questions and friction points this paper is trying to address.

Introducing distributed neural architectures for vision and language tasks

Learning computation and communication patterns end-to-end during training

Analyzing emergent connectivity and specialization in model paths

Innovation

Methods, ideas, or system contributions that make the work stand out.

Distributed neural architectures with dynamic routing

End-to-end learning of computation patterns

Emergent specialization in module paths

🔎 Similar Papers

No similar papers found.