🤖 AI Summary
This paper addresses the high parameter redundancy and substantial memory overhead in large language model (LLM) training. It introduces rate-distortion optimization (RDO) into the training process for the first time, proposing an end-to-end dynamic compression framework. Methodologically, it integrates Lagrangian multiplier-based rate constraint control, gradient reweighting, and structure-aware sparsification to enable controllable trade-offs between accuracy and parameter complexity. Key contributions include: (1) suppressing redundancy from the training outset, enabling robust pruning at high ratios; (2) achieving zero accuracy loss under 80% parameter pruning, with 60–90% memory reduction; and (3) significantly outperforming post-training compression methods in accuracy, generalization, robustness, and edge deployment efficiency.
📝 Abstract
The rapid advancement of large-language models (LLMs) has driven extensive research into parameter compression after training has been completed, yet compression during the training phase remains largely unexplored. In this work, we introduce Rate-Constrained Training (Backslash), a novel training-time compression approach based on rate-distortion optimization (RDO). Backslash enables a flexible trade-off between model accuracy and complexity, significantly reducing parameter redundancy while preserving performance. Experiments in various architectures and tasks demonstrate that Backslash can reduce memory usage by 60% - 90% without accuracy loss and provides significant compression gain compared to compression after training. Moreover, Backslash proves to be highly versatile: it enhances generalization with small Lagrange multipliers, improves model robustness to pruning (maintaining accuracy even at 80% pruning rates), and enables network simplification for accelerated inference on edge devices.