Calibrating Transformers via Sparse Gaussian Processes

📅 2023-03-04

🏛️ International Conference on Learning Representations

📈 Citations: 13

✨ Influential: 2

career value

175K/year

🤖 AI Summary

To address the poor calibration of uncertainty estimation in Transformer models for safety-critical applications, this paper proposes Sparse Gaussian Process Attention (SGPA)—the first method to embed a sparse Gaussian process directly into the multi-head attention mechanism, enabling scalable Bayesian inference in the output space. SGPA replaces the standard scaled dot-product with a symmetric kernel function, balancing kernel expressiveness and tractability of posterior approximation. Empirically, SGPA preserves state-of-the-art accuracy on mainstream text, image, and graph prediction tasks while significantly improving in-distribution predictive calibration, out-of-distribution robustness, and anomaly detection performance. By unifying deep representation learning with principled Bayesian uncertainty quantification within the attention module, SGPA establishes a novel paradigm for trustworthy Transformer modeling.

📝 Abstract

Transformer models have achieved profound success in prediction tasks in a wide range of applications in natural language processing, speech recognition and computer vision. Extending Transformer's success to safety-critical domains requires calibrated uncertainty estimation which remains under-explored. To address this, we propose Sparse Gaussian Process attention (SGPA), which performs Bayesian inference directly in the output space of multi-head attention blocks (MHAs) in transformer to calibrate its uncertainty. It replaces the scaled dot-product operation with a valid symmetric kernel and uses sparse Gaussian processes (SGP) techniques to approximate the posterior processes of MHA outputs. Empirically, on a suite of prediction tasks on text, images and graphs, SGPA-based Transformers achieve competitive predictive accuracy, while noticeably improving both in-distribution calibration and out-of-distribution robustness and detection.

Problem

Research questions and friction points this paper is trying to address.

Calibrating Transformers for uncertainty in safety-critical domains

Replacing dot-product with kernel for Bayesian attention

Improving calibration and robustness across multimodal tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse Gaussian Process attention for Transformers

Replaces dot-product with symmetric kernel

Uses SGP for Bayesian inference in MHAs

🔎 Similar Papers

Calibration in Deep Learning: A Survey of the State-of-the-Art