🤖 AI Summary
Large-scale deep models incur high memory overhead during inference due to their massive parameter counts. Existing parameter-sharing methods rely on heuristic, adjacent-layer designs and lack systematic scalability across multiple layers. This work pioneers a graph coloring formulation for cross-layer parameter sharing, leveraging structural symmetry in the model’s parameter space for rigorous, system-level modeling. From a group-theoretic perspective, we analyze sharing mechanisms and introduce an analytical criterion grounded in second-order gradient geometry—guiding parameter projection onto low-curvature subspaces. By combining Hessian spectral analysis with Taylor expansion, we formulate optimal parameter grouping as finding the optimal coloring function α: L → C. Evaluated across diverse architectures and tasks, our method consistently outperforms state-of-the-art approaches, achieving superior accuracy at higher compression ratios—demonstrating both theoretical rigor and engineering scalability.
📝 Abstract
Modern deep models have massive parameter sizes, leading to high inference-time memory usage that limits practical deployment. Parameter sharing, a form of structured compression, effectively reduces redundancy, but existing approaches remain heuristic-restricted to adjacent layers and lacking a systematic analysis for cross-layer sharing. However, extending sharing across multiple layers leads to an exponentially expanding configuration space, making exhaustive search computationally infeasible and forming a critical bottleneck for parameter sharing. We recast parameter sharing from a group-theoretic perspective as introducing structural symmetries in the model's parameter space. A sharing configuration can be described by a coloring function $alpha:L
ightarrow C$ (L: layer indices and C: sharing classes), which determines inter-layer sharing groups while preserving structural symmetry. To determine the coloring function, we propose a second-order geometric criterion based on Taylor expansion and the Hessian spectrum. By projecting perturbations onto the Hessian's low-curvature eigensubspace, the criterion provides an analytic rule for selecting sharing groups that minimize performance impact, yielding a principled and scalable configuration procedure. Across diverse architectures and tasks, Geo-Sharing consistently outperforms state-of-the-art heuristic sharing strategies, achieving higher compression ratios with smaller accuracy degradation.