QUILL: An Algorithm-Architecture Co-Design for Cache-Local Deformable Attention

📅 2025-11-17

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Deformable attention suffers from low hardware utilization due to irregular memory access patterns and low arithmetic intensity. To address this, we propose a holistic algorithm-architecture co-optimization framework: (1) a distance-aware out-of-order query (DOOQ) scheme and a scheduling-aware regional prefetching mechanism to enhance cache locality; (2) fusion of multi-stage operations into a unified, intermediate-overflow-free compute engine; and (3) integration of on-chip tensor retention, GEMM-accelerated processing, and mixed-precision quantization, implemented via a custom RTL-level accelerator. Evaluated on diverse deformable and sparse DETR models, our design achieves up to 7.29× higher throughput and 47.3× better energy efficiency versus an RTX 4090, with ≤0.9 AP accuracy degradation. End-to-end performance surpasses state-of-the-art accelerators by 3.26–9.82×.

Technology Category

Application Category

📝 Abstract

Deformable transformers deliver state-of-the-art detection but map poorly to hardware due to irregular memory access and low arithmetic intensity. We introduce QUILL, a schedule-aware accelerator that turns deformable attention into cache-friendly, single-pass work. At its core, Distance-based Out-of-Order Querying (DOOQ) orders queries by spatial proximity; the look-ahead drives a region prefetch into an alternate buffer--forming a schedule-aware prefetch loop that overlaps memory and compute. A fused MSDeformAttn engine executes interpolation, Softmax, aggregation, and the final projection (W''m) in one pass without spilling intermediates, while small tensors are kept on-chip and surrounding dense layers run on integrated GEMMs. Implemented as RTL and evaluated end-to-end, QUILL achieves up to 7.29x higher throughput and 47.3x better energy efficiency than an RTX 4090, and exceeds prior accelerators by 3.26-9.82x in throughput and 2.01-6.07x in energy efficiency. With mixed-precision quantization, accuracy tracks FP32 within <=0.9 AP across Deformable and Sparse DETR variants. By converting sparsity into locality--and locality into utilization--QUILL delivers consistent, end-to-end speedups.

Problem

Research questions and friction points this paper is trying to address.

Irregular memory access patterns in deformable transformers limit hardware efficiency

Low arithmetic intensity and memory-compute imbalance reduce accelerator performance

Intermediate tensor spilling and cache-unfriendly operations hinder energy efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cache-friendly single-pass deformable attention accelerator

Distance-based out-of-order querying with prefetching

Fused multi-operation engine with on-chip tensor retention

🔎 Similar Papers

No similar papers found.

Authors to Follow