Training the Untrainable: Introducing Inductive Bias via Representational Alignment

📅 2024-10-26

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

Traditionally “untrainable” architectures—such as fully connected networks, residual-free CNNs, and vanilla RNNs—suffer from severe overfitting or underfitting due to insufficient inductive bias. To address this, we propose a representation alignment guidance mechanism that transfers architectural priors from high-performance “guide” networks (e.g., ResNet, Transformer) to target networks via differentiable neural distance functions, enforcing layer-wise representation alignment. Crucially, the guide network remains frozen while the target network is jointly optimized for both task loss and alignment loss. This work introduces the first quantifiable, differentiable framework for architectural prior transfer, providing a mathematically grounded, optimization-compatible tool for neural architecture design. Experiments demonstrate substantial improvements: fully connected networks achieve markedly enhanced visual generalization; plain CNNs approach ResNet-level performance; the performance gap between vanilla RNNs and Transformers narrows significantly; and remarkably, Transformers exhibit improved accuracy on RNN-favored tasks—a reverse enhancement effect.

Technology Category

Application Category

📝 Abstract

We demonstrate that architectures which traditionally are considered to be ill-suited for a task can be trained using inductive biases from another architecture. Networks are considered untrainable when they overfit, underfit, or converge to poor results even when tuning their hyperparameters. For example, plain fully connected networks overfit on object recognition while deep convolutional networks without residual connections underfit. The traditional answer is to change the architecture to impose some inductive bias, although what that bias is remains unknown. We introduce guidance, where a guide network guides a target network using a neural distance function. The target is optimized to perform well and to match its internal representations, layer-by-layer, to those of the guide; the guide is unchanged. If the guide is trained, this transfers over part of the architectural prior and knowledge of the guide to the target. If the guide is untrained, this transfers over only part of the architectural prior of the guide. In this manner, we can investigate what kinds of priors different architectures place on untrainable networks such as fully connected networks. We demonstrate that this method overcomes the immediate overfitting of fully connected networks on vision tasks, makes plain CNNs competitive to ResNets, closes much of the gap between plain vanilla RNNs and Transformers, and can even help Transformers learn tasks which RNNs can perform more easily. We also discover evidence that better initializations of fully connected networks likely exist to avoid overfitting. Our method provides a mathematical tool to investigate priors and architectures, and in the long term, may demystify the dark art of architecture creation, even perhaps turning architectures into a continuous optimizable parameter of the network.

Problem

Research questions and friction points this paper is trying to address.

Overcoming architectural limitations through representational alignment guidance

Transferring inductive biases between networks to improve training performance

Enabling traditionally unsuitable architectures to achieve better task results

Innovation

Methods, ideas, or system contributions that make the work stand out.

Using inductive bias from guide networks

Layerwise representational similarity for alignment

Guidance-driven initialization to prevent overfitting

🔎 Similar Papers

From Prejudice to Parity: A New Approach to Debiasing Large Language Model Word Embeddings