🤖 AI Summary
Deploying quantized LLMs on edge FPGAs faces severe scalability bottlenecks: long prompts cause quadratic growth in prefill computation and surge KV-cache bandwidth demand, drastically increasing decoding latency. Static accelerators suffer from attention logic redundancy, low structural utilization, and LUT/URAM bottlenecks due to fixed hardware sharing between prefill and decoding phases, limiting model scale and context length.
Method: We propose a time-multiplexed architecture leveraging dynamic partial reconfiguration (DPR) to enable phase-specific hardware specialization—reconfiguring the same physical resources for prefill or decoding on-the-fly. Our design integrates roofline modeling, ternary lookup-table-based matrix multiplication, KV-cache-aware decoding streaming, and reconfiguration latency hiding.
Contribution/Results: On edge FPGAs, our approach achieves up to 27 tokens/s decoding throughput—1.3×–2.1× higher than SOTA—while incurring zero additional area overhead.
📝 Abstract
Aggressively quantized large language models (LLMs), such as BitNet-style 1.58-bit Transformers with ternary weights, make it feasible to deploy generative AI on low-power edge FPGAs. However, as prompts grow to tens of thousands of tokens, edge hardware performance drops sharply with sequence length due to quadratic prefill cost and rapidly increasing KV-cache bandwidth demands, making inference latency of longer context length a first-order system concern. Recent studies on LLMs expose a fundamental prefill-decode asymmetry: prefill is compute-bound and dominated by dense matrix-matrix operations, whereas decoding is memory-bandwidth-bound and dominated by KV-cache traffic. A static accelerator must provision resources and a single dataflow for both regimes, leading to duplicated attention logic, underutilized fabric, and tight LUT/URAM limits that cap model size and usable context. We propose a prefill--decode disaggregated LLM accelerator, PD-Swap, that uses Dynamic Partial Reconfiguration (DPR) to time-multiplex the attention module on edge FPGAs. The core table-lookup ternary matrix multiplication and weight-buffering engines remain static, while the attention subsystem is a reconfigurable partition with two phase-specialized architectures: a compute-heavy, token-parallel prefill engine and a bandwidth-optimized, KV-cache-centric decoding engine. A roofline-inspired model and design space exploration jointly optimize reconfigurable-region size, parallelism under reconfiguration and routability constraints, and reconfiguration latency is hidden by computation latency. PD-Swap achieves up to 27~tokens/s decoding throughput, outperforming prior state-of-the-art works by 1.3x--2.1x (larger gains at longer context lengths), without extra area cost.