🤖 AI Summary
To address the slow convergence of gradient descent and the scalability limitations—high computational overhead and incompatibility with CNNs/Transformers—of existing teleportation methods, this paper proposes a zero-space gradient projection-based teleportation algorithm. Our core innovation is the first rigorous projection of gradients onto the input nullspace, guaranteeing exact preservation of the loss value and enabling efficient, architecture-agnostic navigation across MLPs, CNNs, and Transformers within parameter space. The method integrates nullspace projection, loss-invariant manifold optimization, and gradient orthogonal decomposition, and introduces a differentiable teleportation objective. Extensive experiments across multiple benchmark datasets and optimizers demonstrate that our approach significantly reduces computational cost while accelerating convergence and maintaining or improving final model accuracy.
📝 Abstract
Optimization techniques have become increasingly critical due to the ever-growing model complexity and data scale. In particular, teleportation has emerged as a promising approach, which accelerates convergence of gradient descent-based methods by navigating within the loss invariant level set to identify parameters with advantageous geometric properties. Existing teleportation algorithms have primarily demonstrated their effectiveness in optimizing Multi-Layer Perceptrons (MLPs), but their extension to more advanced architectures, such as Convolutional Neural Networks (CNNs) and Transformers, remains challenging. Moreover, they often impose significant computational demands, limiting their applicability to complex architectures. To this end, we introduce an algorithm that projects the gradient of the teleportation objective function onto the input null space, effectively preserving the teleportation within the loss invariant level set and reducing computational cost. Our approach is readily generalizable from MLPs to CNNs, transformers, and potentially other advanced architectures. We validate the effectiveness of our algorithm across various benchmark datasets and optimizers, demonstrating its broad applicability.