Revisiting Kernel Attention with Correlated Gaussian Process Representation

📅 2025-02-27
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation in Transformer-based uncertainty calibration—specifically, the restrictive symmetric kernel assumption of Gaussian processes (GPs) in existing GP-Transformer models. We propose a Correlated Gaussian Process (CGP) attention mechanism. Methodologically, we formulate self-attention as the cross-covariance between two correlated yet asymmetric GPs, thereby relaxing the symmetry constraint inherent in conventional GP-Transformers and enhancing representational capacity. To ensure scalability, we further introduce a sparse variational CGP approximation. Empirical evaluation across multiple benchmark tasks demonstrates that our approach consistently outperforms state-of-the-art GP-based Transformers, validating substantial improvements in uncertainty calibration accuracy, modeling flexibility, and predictive performance.

Technology Category

Application Category

📝 Abstract
Transformers have increasingly become the de facto method to model sequential data with state-of-the-art performance. Due to its widespread use, being able to estimate and calibrate its modeling uncertainty is important to understand and design robust transformer models. To achieve this, previous works have used Gaussian processes (GPs) to perform uncertainty calibration for the attention units of transformers and attained notable successes. However, such approaches have to confine the transformers to the space of symmetric attention to ensure the necessary symmetric requirement of their GP's kernel specification, which reduces the representation capacity of the model. To mitigate this restriction, we propose the Correlated Gaussian Process Transformer (CGPT), a new class of transformers whose self-attention units are modeled as cross-covariance between two correlated GPs (CGPs). This allows asymmetries in attention and can enhance the representation capacity of GP-based transformers. We also derive a sparse approximation for CGP to make it scale better. Our empirical studies show that both CGP-based and sparse CGP-based transformers achieve better performance than state-of-the-art GP-based transformers on a variety of benchmark tasks. The code for our experiments is available at https://github.com/MinhLong210/CGP-Transformers.
Problem

Research questions and friction points this paper is trying to address.

Estimating and calibrating uncertainty in transformer models.
Overcoming symmetric attention limitations in GP-based transformers.
Enhancing representation capacity with correlated Gaussian processes.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Correlated Gaussian Process Transformer (CGPT)
Models self-attention as cross-covariance between GPs
Derives sparse approximation for scalable CGP transformers
🔎 Similar Papers
No similar papers found.