🤖 AI Summary
This work addresses the lack of theoretical understanding regarding how residual networks progressively approximate target functions and coordinate across layers. The authors model residual networks as a monotonic error-decaying process from input to target, establishing—for the first time—the existence of a provable progressive approximation trajectory. They introduce an architecture-agnostic, layer-wise progressive approximation (LPA) training method that explicitly aligns each layer with its local residual objective. This approach enables a “train once, deploy at multiple depths” inference paradigm, demonstrated successfully on residual fully connected networks, ResNets, and Transformers. Experiments across surface fitting, image classification, and large language modeling tasks confirm that a single trained model can produce valid predictions at arbitrary depths, substantially improving inference efficiency.
📝 Abstract
The Universal Approximation Theorem (UAT) guarantees universal function approximation but does not explain how residual models distribute approximation across layers. We reframe residual networks as a layer-wise approximation process that builds an approximation trajectory from input to target, and prove the existence of progressive trajectories where error decreases monotonically with depth. It reveals that residual networks can implement structured, step-by-step refinement rather than end-to-end (E2E) black-box mapping. Building on this, we propose Layer-wise Progressive Approximation (LPA), a theoretically grounded training principle that explicitly aligns each layer with its residual target to realize such trajectories. LPA is architecture-agnostic: we observe progressive behavior in residual FNNs, ResNets, and Transformers across tasks including complex surface fitting, image classification, and NLP with LLMs for generation and classification. Crucially, this enables ``train once, use $N$ models": a single network yields useful predictions at every depth, supporting efficient shallow inference without retraining. Our work unifies approximation theory with practical deep learning, providing a new lens on representation learning and a flexible framework for multi-depth deployment. The source code will be released unpon acceptance at https://(open\_upon\_acceptance).