🤖 AI Summary
Existing LLM training optimization methods primarily focus on architectural modifications or optimizer enhancements, neglecting systematic regulation of weight patterns—i.e., parameter distributions and relative magnitudes. Method: This paper proposes WISCA, the first approach to dynamically optimize weight patterns during training via provably equivalent weight rescaling—without altering model architecture, optimizer, or output layer. WISCA is inherently compatible with mainstream paradigms including Grouped-Query Attention (GQA) and Low-Rank Adaptation (LoRA). Contribution/Results: Experiments across multiple LLMs demonstrate that WISCA consistently improves zero-shot validation accuracy by an average of 5.6%, reduces training perplexity by 2.12%, and significantly stabilizes training dynamics while enhancing generalization. By decoupling weight pattern optimization from structural or algorithmic changes, WISCA establishes a novel, efficient paradigm for large-model training.
📝 Abstract
Transformer architecture gradually dominates the LLM field. Recent advances in training optimization for Transformer-based large language models (LLMs) primarily focus on architectural modifications or optimizer adjustments. However, these approaches lack systematic optimization of weight patterns during training. Weight pattern refers to the distribution and relative magnitudes of weight parameters in a neural network. To address this issue, we propose a Weight Scaling method called WISCA to enhance training efficiency and model quality by strategically improving neural network weight patterns without changing network structures. By rescaling weights while preserving model outputs, WISCA indirectly optimizes the model's training trajectory. Experiments demonstrate that WISCA significantly improves convergence quality (measured by generalization capability and loss reduction), particularly in LLMs with Grouped Query Attention (GQA) architectures and LoRA fine-tuning tasks. Empirical results show 5.6% average improvement on zero-shot validation tasks and 2.12% average reduction in training perplexity across multiple architectures.