WISCA: A Lightweight Model Transition Method to Improve LLM Training via Weight Scaling

📅 2025-08-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM training optimization methods primarily focus on architectural modifications or optimizer enhancements, neglecting systematic regulation of weight patterns—i.e., parameter distributions and relative magnitudes. Method: This paper proposes WISCA, the first approach to dynamically optimize weight patterns during training via provably equivalent weight rescaling—without altering model architecture, optimizer, or output layer. WISCA is inherently compatible with mainstream paradigms including Grouped-Query Attention (GQA) and Low-Rank Adaptation (LoRA). Contribution/Results: Experiments across multiple LLMs demonstrate that WISCA consistently improves zero-shot validation accuracy by an average of 5.6%, reduces training perplexity by 2.12%, and significantly stabilizes training dynamics while enhancing generalization. By decoupling weight pattern optimization from structural or algorithmic changes, WISCA establishes a novel, efficient paradigm for large-model training.

Technology Category

Application Category

📝 Abstract
Transformer architecture gradually dominates the LLM field. Recent advances in training optimization for Transformer-based large language models (LLMs) primarily focus on architectural modifications or optimizer adjustments. However, these approaches lack systematic optimization of weight patterns during training. Weight pattern refers to the distribution and relative magnitudes of weight parameters in a neural network. To address this issue, we propose a Weight Scaling method called WISCA to enhance training efficiency and model quality by strategically improving neural network weight patterns without changing network structures. By rescaling weights while preserving model outputs, WISCA indirectly optimizes the model's training trajectory. Experiments demonstrate that WISCA significantly improves convergence quality (measured by generalization capability and loss reduction), particularly in LLMs with Grouped Query Attention (GQA) architectures and LoRA fine-tuning tasks. Empirical results show 5.6% average improvement on zero-shot validation tasks and 2.12% average reduction in training perplexity across multiple architectures.
Problem

Research questions and friction points this paper is trying to address.

Optimizing weight patterns in Transformer LLMs during training
Improving training efficiency without architectural changes
Enhancing convergence quality and generalization in LLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Weight scaling optimizes training without structural changes
Rescaling preserves outputs while improving convergence quality
Enhances efficiency in GQA architectures and LoRA fine-tuning
🔎 Similar Papers
No similar papers found.
J
Jiacheng Li
Meituan, Beijing, China
Jianchao Tan
Jianchao Tan
Meituan
LLMAutomated Machine LearningComputer GraphicsComputer Vision
Z
Zhidong Yang
State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
P
Pingwei Sun
Meituan, Beijing, China
F
Feiye Huo
Meituan, Beijing, China
Jiayu Qin
Jiayu Qin
University at Buffalo
machine learning
Y
Yerui Sun
Meituan, Beijing, China
Y
Yuchen Xie
Meituan, Beijing, China
X
Xunliang Cai
Meituan, Beijing, China
X
Xiangyu Zhang
State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
M
Maoxin He
Xiamen University, Xiamen, China
G
Guangming Tan
State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
W
Weile Jia
State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
T
Tong Zhao
State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China