Quantum linear algebra is all you need for Transformer architectures

📅 2024-02-26

🏛️ arXiv.org

📈 Citations: 21

✨ Influential: 3

career value

180K/year

🤖 AI Summary

This work addresses the high computational cost of Transformer inference in large language models (LLMs) by proposing the first end-to-end quantum Transformer architecture designed for fault-tolerant quantum computing. Methodologically, it encodes pretrained weight matrices into quantum block encodings to realize quantum representations of queries, keys, and values; introduces the first row-wise quantum softmax subroutine; and uniformly constructs quantum analogues of residual connections, layer normalization, and feed-forward networks. Output is represented via amplitude encoding, enabling prediction extraction through quantum measurement. Contributions include: (1) the first complete, end-to-end quantum Transformer construction framework; (2) the first scalable quantum row-wise softmax algorithm; and (3) a systematic quantitative evaluation demonstrating theoretical quantum advantage for mainstream LLM scales, while identifying critical hardware implementation bottlenecks—particularly in qubit count, circuit depth, and oracle complexity—for near-term realization.

Technology Category

Application Category

📝 Abstract

Generative machine learning methods such as large-language models are revolutionizing the creation of text and images. While these models are powerful they also harness a large amount of computational resources. The transformer is a key component in large language models that aims to generate a suitable completion of a given partial sequence. In this work, we investigate transformer architectures under the lens of fault-tolerant quantum computing. The input model is one where trained weight matrices are given as block encodings and we construct the query, key, and value matrices for the transformer. We show how to prepare a block encoding of the self-attention matrix, with a new subroutine for the row-wise application of the softmax function. In addition, we combine quantum subroutines to construct important building blocks in the transformer, the residual connection and layer normalization, and the feed-forward neural network. Our subroutines prepare an amplitude encoding of the transformer output, which can be measured to obtain a prediction. Based on common open-source large-language models, we provide insights into the behavior of important parameters determining the run time of the quantum algorithm. We discuss the potential and challenges for obtaining a quantum advantage.

Problem

Research questions and friction points this paper is trying to address.

Accelerating transformer inference using quantum linear algebra

Implementing quantum subroutines for transformer architecture components

Demonstrating quantum speedup potential for large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Quantum linear algebra accelerates transformer inference

Hadamard product and element-wise functions on quantum computers

Amplitude encoding of transformer output for quantum speedup

🔎 Similar Papers

Learning with SASQuaTCh: a Novel Variational Quantum Transformer Architecture with Kernel-Based Self-Attention

2024-03-21arXiv.orgCitations: 0

Qualcomm

$140,800.00 - $211,200.00

San Diego, California, United States of America

Authors to Follow