Precision Tracked Transformer via Kalman Filtering, Kriging and Process Noise

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

Traditional Transformers struggle to handle uncertainties inherent in real-world scenarios, such as cold-start conditions, heterogeneous signal quality, and uniform attention confidence. This work proposes the Bayesian Filtering Transformer (BFT), which systematically integrates uncertainty modeling into the Transformer architecture for the first time. Specifically, BFT reformulates attention as precision-weighted kriging interpolation, interprets residual connections as Kalman updates with adaptive gain, models the feed-forward network as a dynamical system propagating precision through Jacobian matrices, and employs a parameter-free REML estimator to learn observation precisions. By coherently tracking and propagating uncertainty throughout the model, BFT achieves significant performance gains across six sequential recommendation benchmarks, excelling particularly on cold-start users and rare items, and demonstrates enhanced robustness in fine-tuning large language models under noisy supervision and contextual perturbations.

📝 Abstract

The Transformer is the foundational building block of modern AI, yet offers no principled handling of \emph{uncertainty}, which is prevalent in real applications: cold-start tokens with sparse histories in sequential recommendation, heterogeneous signal quality in language models, and attention sinks induced by unconstrained softmax. Every token is treated with uniform confidence. We show this uniformity is a degenerate case of our \emph{Bayesian Filtering Transformer} (BFT): attention becomes precision-weighted kriging, the residual connection becomes a Kalman update with adaptive gain, and the FFN becomes a dynamics model propagating precision via a Jacobian--plus--process-noise rule. Observation precision comes from a parameter-free Restricted Maximum Likelihood (REML) estimator with a conjugate Bayesian prior. BFT replaces any Transformer layer with negligible overhead. On sequential recommendation, BFT applied to three major architectures yields significant gains on six benchmarks, with the largest improvements on cold-start users and rare items where uncertainty is highest. On supervised fine-tuning of large language models with noisy data, BFT improves robustness in two regimes: noisy supervision (token-label corruption in question answering) and noisy context (retrieval-augmented QA with real RAG distractors). A single principled modification -- restoring precision -- unlocks substantial headroom across both classical sequence-modeling and modern LLM regimes.

Problem

Research questions and friction points this paper is trying to address.

uncertainty

Transformer

sequential recommendation

noisy data

attention mechanism

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bayesian Filtering Transformer

Kalman Filtering

Kriging