🤖 AI Summary
Existing efficient Transformer variants—such as sparse, sliding-window, and linear attention methods—typically trade off context recall capability for computational or memory efficiency, requiring predefined, static quality-efficiency compromises that hinder adaptation to diverse downstream tasks. Their designs rely on heuristic constraints, hand-crafted state updates, or hybrid architectures, limiting flexibility and structural simplicity.
Method: We propose the Compress&Attend Transformer (CAT), the first unified architecture enabling *training-time* multi-granularity sequence compression jointly with dense attention, and supporting *inference-time*, zero-shot dynamic adjustment of compression granularity and block size without retraining.
Contribution/Results: Experiments show that a single CAT model matches full-attention baselines in language modeling while accelerating inference by 1.4–3× and reducing memory consumption by 2–9×—significantly outperforming established efficient Transformer baselines across quality-efficiency trade-offs.
📝 Abstract
The quadratic cost of attention in transformers motivated the development of efficient approaches: namely sparse and sliding window attention, convolutions and linear attention. Although these approaches result in impressive reductions in compute and memory, they often trade-off with quality, specifically in-context recall performance. Moreover, apriori fixing this quality-compute tradeoff means being suboptimal from the get-go: some downstream applications require more memory for in-context recall, while others require lower latency and memory. Further, these approaches rely on heuristic choices that artificially restrict attention, or require handcrafted and complex recurrent state update rules, or they must be carefully composed with attention at specific layers to form a hybrid architecture that complicates the design process, especially at scale. To address above issues, we propose Compress&Attend Transformer (CAT), a conceptually simple architecture employing two simple ingredients only: dense attention and compression. CAT decodes chunks of tokens by attending to compressed chunks of the sequence so far. Compression results in decoding from a reduced sequence length that yields compute and memory savings, while choosing a particular chunk size trades-off quality for efficiency. Moreover, CAT can be trained with multiple chunk sizes at once, unlocking control of quality-compute trade-offs directly at test-time without any retraining, all in a single adaptive architecture. In exhaustive evaluations on common language modeling tasks, in-context recall, and long-context understanding, a single adaptive CAT model outperforms existing efficient baselines, including hybrid architectures, across different compute-memory budgets. Further, a single CAT matches dense transformer in language modeling across model scales while being 1.4-3x faster and requiring 2-9x lower total memory usage.