PD-Swap: Prefill-Decode Logic Swapping for End-to-End LLM Inference on Edge FPGAs via Dynamic Partial Reconfiguration

📅 2025-12-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Deploying quantized LLMs on edge FPGAs faces severe scalability bottlenecks: long prompts cause quadratic growth in prefill computation and surge KV-cache bandwidth demand, drastically increasing decoding latency. Static accelerators suffer from attention logic redundancy, low structural utilization, and LUT/URAM bottlenecks due to fixed hardware sharing between prefill and decoding phases, limiting model scale and context length. Method: We propose a time-multiplexed architecture leveraging dynamic partial reconfiguration (DPR) to enable phase-specific hardware specialization—reconfiguring the same physical resources for prefill or decoding on-the-fly. Our design integrates roofline modeling, ternary lookup-table-based matrix multiplication, KV-cache-aware decoding streaming, and reconfiguration latency hiding. Contribution/Results: On edge FPGAs, our approach achieves up to 27 tokens/s decoding throughput—1.3×–2.1× higher than SOTA—while incurring zero additional area overhead.

Technology Category

Application Category

📝 Abstract
Aggressively quantized large language models (LLMs), such as BitNet-style 1.58-bit Transformers with ternary weights, make it feasible to deploy generative AI on low-power edge FPGAs. However, as prompts grow to tens of thousands of tokens, edge hardware performance drops sharply with sequence length due to quadratic prefill cost and rapidly increasing KV-cache bandwidth demands, making inference latency of longer context length a first-order system concern. Recent studies on LLMs expose a fundamental prefill-decode asymmetry: prefill is compute-bound and dominated by dense matrix-matrix operations, whereas decoding is memory-bandwidth-bound and dominated by KV-cache traffic. A static accelerator must provision resources and a single dataflow for both regimes, leading to duplicated attention logic, underutilized fabric, and tight LUT/URAM limits that cap model size and usable context. We propose a prefill--decode disaggregated LLM accelerator, PD-Swap, that uses Dynamic Partial Reconfiguration (DPR) to time-multiplex the attention module on edge FPGAs. The core table-lookup ternary matrix multiplication and weight-buffering engines remain static, while the attention subsystem is a reconfigurable partition with two phase-specialized architectures: a compute-heavy, token-parallel prefill engine and a bandwidth-optimized, KV-cache-centric decoding engine. A roofline-inspired model and design space exploration jointly optimize reconfigurable-region size, parallelism under reconfiguration and routability constraints, and reconfiguration latency is hidden by computation latency. PD-Swap achieves up to 27~tokens/s decoding throughput, outperforming prior state-of-the-art works by 1.3x--2.1x (larger gains at longer context lengths), without extra area cost.
Problem

Research questions and friction points this paper is trying to address.

Addresses LLM inference latency on edge FPGAs with long prompts
Overcomes prefill-decompute asymmetry via dynamic hardware reconfiguration
Optimizes resource use to boost throughput without extra area cost
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Dynamic Partial Reconfiguration to swap attention modules
Implements compute-heavy prefill and bandwidth-optimized decode engines
Hides reconfiguration latency with computation to boost throughput
🔎 Similar Papers
No similar papers found.