🤖 AI Summary
Transformer attention suffers from O(N²) computational complexity due to dense query-key similarity computations, severely limiting scalability. To address this, we propose Binary Attention CAM (BA-CAM), the first architecture that maps attention computation onto in-memory associative storage operations in the analog voltage domain: charge-sharing circuits enable constant-time similarity search; a two-level hierarchical top-k selection, pipelined parallel execution, and high-fidelity contextual recovery circuitry jointly optimize accuracy and hardware efficiency. Evaluated on BERT and ViT workloads, BA-CAM achieves >10× energy efficiency improvement, up to 4× higher throughput, and 6–8× smaller area versus digital baselines—while preserving near-lossless accuracy. Our core contribution is a paradigm shift to in-memory computing for attention, fundamentally circumventing the computational bottlenecks inherent in conventional digital implementations.
📝 Abstract
Transformers face scalability challenges due to the quadratic cost of attention, which involves dense similarity computations between queries and keys. We propose CAMformer, a novel accelerator that reinterprets attention as an associative memory operation and computes attention scores using a voltage-domain Binary Attention Content Addressable Memory (BA-CAM). This enables constant-time similarity search through analog charge sharing, replacing digital arithmetic with physical similarity sensing. CAMformer integrates hierarchical two-stage top-k filtering, pipelined execution, and high-precision contextualization to achieve both algorithmic accuracy and architectural efficiency. Evaluated on BERT and Vision Transformer workloads, CAMformer achieves over 10x energy efficiency, up to 4x higher throughput, and 6-8x lower area compared to state-of-the-art accelerators--while maintaining near-lossless accuracy.