Neural Network Optimization Reimagined: Decoupled Techniques for Scratch and Fine-Tuning

📅 2026-04-21

📈 Citations: 0

✨ Influential: 0

career value

159K/year

🤖 AI Summary

Existing optimizers fail to adequately distinguish between the distinct requirements of training from scratch and fine-tuning, often struggling to balance convergence, generalization, and preservation of pre-trained knowledge. This work proposes DualOpt, the first optimizer framework that decouples optimization strategies according to the training paradigm: it introduces real-time hierarchical weight decay tailored for training from scratch and embeds a weight rollback mechanism within the optimizer for fine-tuning, dynamically adjusting the rollback strength per layer to suit downstream tasks. This approach substantially mitigates catastrophic forgetting during fine-tuning and achieves state-of-the-art performance across diverse vision tasks—including image classification, object detection, semantic segmentation, and instance segmentation—demonstrating its broad applicability and effectiveness.

Technology Category

Application Category

📝 Abstract

With the accumulation of resources in the era of big data and the rise of pre-trained models in deep learning, optimizing neural networks for various tasks often involves different strategies for fine-tuning pre-trained models versus training from scratch. However, existing optimizers primarily focus on reducing the loss function by updating model parameters, without fully addressing the unique demands of these two major paradigms. In this paper, we propose DualOpt, a novel approach that decouples optimization techniques specifically tailored for these distinct training scenarios. For training from scratch, we introduce real-time layer-wise weight decay, designed to enhance both convergence and generalization by aligning with the characteristics of weight updates and network architecture. For more importantly fine-tuning, we integrate weight rollback with the optimizer, incorporating a rollback term into each weight update step. This ensures consistency in the weight distribution between upstream and downstream models, effectively mitigating knowledge forgetting and improving fine-tuning performance. Additionally, we extend the layer-wise weight decay to dynamically adjust the rollback levels across layers, adapting to the varying demands of different downstream tasks. Extensive experiments across diverse tasks, including image classification, object detection, semantic segmentation, and instance segmentation, demonstrate the broad applicability and state-of-the-art performance of DualOpt. Code is available at https://github.com/qklee-lz/OLOR-AAAI-2024.

Problem

Research questions and friction points this paper is trying to address.

neural network optimization

training from scratch

fine-tuning

knowledge forgetting

weight decay

Innovation

Methods, ideas, or system contributions that make the work stand out.

decoupled optimization

layer-wise weight decay

weight rollback