🤖 AI Summary
To address the instability and data inefficiency of sequential fine-tuning Transformers under continual streaming data and distributional shift, this paper proposes an uncertainty-aware Bayesian sequential learning framework. Methodologically, fine-tuning is formulated as posterior inference, integrating Kalman filtering with closed-form moment propagation; a Taylor approximation is applied to the softmax layer for differentiable moment estimation, while pretrained weights serve as explicit priors—enabling efficient online updates and quantization-friendly deployment. Our key contribution is the first adaptation of Kalman Bayesian Neural Networks to Transformer sequential adaptation, jointly optimizing robustness, low latency, and memory efficiency. Evaluated on Decision Transformer tasks, our approach significantly improves generalization under distribution shifts, achieves stable convergence with only a few new samples, and demonstrates high data efficiency and strong uncertainty calibration.
📝 Abstract
Sequential fine-tuning of transformers is useful when new data arrive sequentially, especially with shifting distributions. Unlike batch learning, sequential learning demands that training be stabilized despite a small amount of data by balancing new information and previously learned knowledge in the pre-trained models. This challenge is further complicated when training is to be completed in latency-critical environments and learning must additionally quantify and be mediated by uncertainty. Motivated by these challenges, we propose a novel method that frames sequential fine-tuning as a posterior inference problem within a Bayesian framework. Our approach integrates closed-form moment propagation of random variables, Kalman Bayesian Neural Networks, and Taylor approximations of the moments of softmax functions. By explicitly accounting for pre-trained models as priors and adaptively balancing them against new information based on quantified uncertainty, our method achieves robust and data-efficient sequential learning. The effectiveness of our method is demonstrated through numerical simulations involving sequential adaptation of a decision transformer to tasks characterized by distribution shifts and limited memory resources.