Calibrating Transformers via Sparse Gaussian Processes

πŸ“… 2023-03-04
πŸ›οΈ International Conference on Learning Representations
πŸ“ˆ Citations: 13
✨ Influential: 2
πŸ“„ PDF
πŸ€– AI Summary
To address the poor calibration of uncertainty estimation in Transformer models for safety-critical applications, this paper proposes Sparse Gaussian Process Attention (SGPA)β€”the first method to embed a sparse Gaussian process directly into the multi-head attention mechanism, enabling scalable Bayesian inference in the output space. SGPA replaces the standard scaled dot-product with a symmetric kernel function, balancing kernel expressiveness and tractability of posterior approximation. Empirically, SGPA preserves state-of-the-art accuracy on mainstream text, image, and graph prediction tasks while significantly improving in-distribution predictive calibration, out-of-distribution robustness, and anomaly detection performance. By unifying deep representation learning with principled Bayesian uncertainty quantification within the attention module, SGPA establishes a novel paradigm for trustworthy Transformer modeling.
πŸ“ Abstract
Transformer models have achieved profound success in prediction tasks in a wide range of applications in natural language processing, speech recognition and computer vision. Extending Transformer's success to safety-critical domains requires calibrated uncertainty estimation which remains under-explored. To address this, we propose Sparse Gaussian Process attention (SGPA), which performs Bayesian inference directly in the output space of multi-head attention blocks (MHAs) in transformer to calibrate its uncertainty. It replaces the scaled dot-product operation with a valid symmetric kernel and uses sparse Gaussian processes (SGP) techniques to approximate the posterior processes of MHA outputs. Empirically, on a suite of prediction tasks on text, images and graphs, SGPA-based Transformers achieve competitive predictive accuracy, while noticeably improving both in-distribution calibration and out-of-distribution robustness and detection.
Problem

Research questions and friction points this paper is trying to address.

Calibrating Transformers for uncertainty in safety-critical domains
Replacing dot-product with kernel for Bayesian attention
Improving calibration and robustness across multimodal tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse Gaussian Process attention for Transformers
Replaces dot-product with symmetric kernel
Uses SGP for Bayesian inference in MHAs
πŸ”Ž Similar Papers
No similar papers found.