๐ค AI Summary
This work investigates the fundamental differences in functional behavior between recurrent neural networks (RNNs) and deep neural networks (DNNs) when their architectures are structurally identical except for weight sharing, focusing on their distinct feature learning mechanisms. By developing a unified mean-field theoretical framework, the authors model the representations of both architectures under ฮผP parametrization as representation kernels and interpret the training dynamics as Bayesian inference over sequences and patterns. The study establishes, for the first time, a unified theory of RNNs and DNNs at the level of feature learning, revealing that weight sharing induces a phase transition in representational correlations and introduces a distinct inductive bias. Specifically, below a critical signal strength threshold, both architectures behave identically; above it, RNNs develop correlated representations across time steps and exhibit superior generalization by interpolating unseen time steps in sequential tasks.
๐ Abstract
Recurrent and deep neural networks (RNNs/DNNs) are cornerstone architectures in machine learning. Remarkably, RNNs differ from DNNs only by weight sharing, as can be shown through unrolling in time. How does this structural similarity fit with the distinct functional properties these networks exhibit? To address this question, we here develop a unified mean-field theory for RNNs and DNNs in terms of representational kernels, describing fully trained networks in the feature learning ($ฮผ$P) regime. This theory casts training as Bayesian inference over sequences and patterns, directly revealing the functional implications induced by the RNNs' weight sharing. In DNN-typical tasks, we identify a phase transition when the learning signal overcomes the noise due to randomness in the weights: below this threshold, RNNs and DNNs behave identically; above it, only RNNs develop correlated representations across timesteps. For sequential tasks, the RNNs' weight sharing furthermore induces an inductive bias that aids generalization by interpolating unsupervised time steps. Overall, our theory offers a way to connect architectural structure to functional biases.