WISCA: A Lightweight Model Transition Method to Improve LLM Training via Weight Scaling

📅 2025-08-21

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Existing LLM training optimization methods primarily focus on architectural modifications or optimizer enhancements, neglecting systematic regulation of weight patterns—i.e., parameter distributions and relative magnitudes. Method: This paper proposes WISCA, the first approach to dynamically optimize weight patterns during training via provably equivalent weight rescaling—without altering model architecture, optimizer, or output layer. WISCA is inherently compatible with mainstream paradigms including Grouped-Query Attention (GQA) and Low-Rank Adaptation (LoRA). Contribution/Results: Experiments across multiple LLMs demonstrate that WISCA consistently improves zero-shot validation accuracy by an average of 5.6%, reduces training perplexity by 2.12%, and significantly stabilizes training dynamics while enhancing generalization. By decoupling weight pattern optimization from structural or algorithmic changes, WISCA establishes a novel, efficient paradigm for large-model training.

Technology Category

Application Category

📝 Abstract

Transformer architecture gradually dominates the LLM field. Recent advances in training optimization for Transformer-based large language models (LLMs) primarily focus on architectural modifications or optimizer adjustments. However, these approaches lack systematic optimization of weight patterns during training. Weight pattern refers to the distribution and relative magnitudes of weight parameters in a neural network. To address this issue, we propose a Weight Scaling method called WISCA to enhance training efficiency and model quality by strategically improving neural network weight patterns without changing network structures. By rescaling weights while preserving model outputs, WISCA indirectly optimizes the model's training trajectory. Experiments demonstrate that WISCA significantly improves convergence quality (measured by generalization capability and loss reduction), particularly in LLMs with Grouped Query Attention (GQA) architectures and LoRA fine-tuning tasks. Empirical results show 5.6% average improvement on zero-shot validation tasks and 2.12% average reduction in training perplexity across multiple architectures.

Problem

Research questions and friction points this paper is trying to address.

Optimizing weight patterns in Transformer LLMs during training

Improving training efficiency without architectural changes

Enhancing convergence quality and generalization in LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Weight scaling optimizes training without structural changes

Rescaling preserves outputs while improving convergence quality

Enhances efficiency in GQA architectures and LoRA fine-tuning

🔎 Similar Papers

SWIFT:A Scalable lightWeight Infrastructure for Fine-Tuning