Policy Gradient with Adaptive Entropy Annealing for Continual Fine-Tuning

📅 2026-02-15

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This work addresses catastrophic forgetting in class-incremental continual fine-tuning of large-scale pretrained vision models, which is commonly induced by cross-entropy loss. The authors reformulate the classification task as a single-step Markov decision process and, for the first time, directly optimize the 0–1 loss from a reinforcement learning perspective via Expected Policy Gradient (EPG), enabling parameter-efficient fine-tuning. They further introduce an adaptive entropy annealing mechanism (aEPG) that dynamically modulates the entropy of the predictive distribution during training to facilitate a smooth transition from exploration to exploitation. Theoretical analysis reveals an intrinsic connection between cross-entropy minimization and policy gradient methods. Extensive experiments across multiple benchmarks and diverse parameter-efficient fine-tuning (PEFT) modules demonstrate that aEPG consistently outperforms cross-entropy baselines, confirming that low-entropy predictions are more conducive to continual adaptation.

Technology Category

Application Category

📝 Abstract

Despite their success, large pretrained vision models remain vulnerable to catastrophic forgetting when adapted to new tasks in class-incremental settings. Parameter-efficient fine-tuning (PEFT) alleviates this by restricting trainable parameters, yet most approaches still rely on cross-entropy (CE) loss, a surrogate for the 0-1 loss, to learn from new data. We revisit this choice and revive the true objective (0-1 loss) through a reinforcement learning perspective. By formulating classification as a one-step Markov Decision Process, we derive an Expected Policy Gradient (EPG) method that directly minimizes misclassification error with a low-variance gradient estimation. Our analysis shows that CE can be interpreted as EPG with an additional sample-weighting mechanism: CE encourages exploration by emphasizing low-confidence samples, while EPG prioritizes high-confidence ones. Building on this insight, we propose adaptive entropy annealing (aEPG), a training strategy that transitions from exploratory (CE-like) to exploitative (EPG-like) learning. aEPG-based methods outperform CE-based methods across diverse benchmarks and with various PEFT modules. More broadly, we evaluate various entropy regularization methods and demonstrate that lower entropy of the output prediction distribution enhances adaptation in pretrained vision models.

Problem

Research questions and friction points this paper is trying to address.

catastrophic forgetting

continual fine-tuning

class-incremental learning

pretrained vision models

parameter-efficient fine-tuning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Expected Policy Gradient

adaptive entropy annealing

catastrophic forgetting