Local Loss Optimization in the Infinite Width: Stable Parameterization of Predictive Coding Networks and Target Propagation

📅 2024-11-04

🏛️ International Conference on Learning Representations

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

This work addresses the training instability and hyperparameter sensitivity of local learning algorithms—such as predictive coding (PC) and target propagation (TP)—in deep networks. We conduct the first systematic analysis of these methods under the infinite-width limit and introduce maximum-update parameterization (μP) to ensure hyperparameter transferability across network widths. Theoretically, we show that PC gradients continuously interpolate between first-order and Gauss–Newton gradients in the infinite-width regime, while TP inherently favors feature learning over kernel-based dynamics. Leveraging rigorous analysis of deep linear networks and the infinite-width limit, we prove that μP guarantees training stability and quantitatively characterize the gradient structure and dynamical regime transitions of PC and TP in the infinite-width setting. Our work establishes the first stable, scalable theoretical framework and principled parameterization guideline for local learning in deep neural networks.

Technology Category

Application Category

📝 Abstract

Local learning, which trains a network through layer-wise local targets and losses, has been studied as an alternative to backpropagation (BP) in neural computation. However, its algorithms often become more complex or require additional hyperparameters because of the locality, making it challenging to identify desirable settings in which the algorithm progresses in a stable manner. To provide theoretical and quantitative insights, we introduce the maximal update parameterization ($mu$P) in the infinite-width limit for two representative designs of local targets: predictive coding (PC) and target propagation (TP). We verified that $mu$P enables hyperparameter transfer across models of different widths. Furthermore, our analysis revealed unique and intriguing properties of $mu$P that are not present in conventional BP. By analyzing deep linear networks, we found that PC's gradients interpolate between first-order and Gauss-Newton-like gradients, depending on the parameterization. We demonstrate that, in specific standard settings, PC in the infinite-width limit behaves more similarly to the first-order gradient. For TP, even with the standard scaling of the last layer, which differs from classical $mu$P, its local loss optimization favors the feature learning regime over the kernel regime.

Problem

Research questions and friction points this paper is trying to address.

Optimizing local learning stability in infinite-width networks

Comparing predictive coding and target propagation parameterizations

Analyzing gradient behaviors in deep linear networks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Maximal update parameterization for stable local learning

Hyperparameter transfer across different model widths

PC gradients interpolate first-order and Gauss-Newton-like

🔎 Similar Papers

No similar papers found.