🤖 AI Summary
Existing log-linear attention mechanisms employ a fixed memory decay parameter, which lacks adaptability to input content and thus limits their capacity for modeling long sequences. This work proposes a content-aware adaptive memory decay approach that dynamically generates decay parameters for each token and Fenwick tree level using a lightweight two-layer MLP with Softplus activation. The method preserves O(n log n) computational complexity while circumventing the hierarchical competition induced by softmax normalization. To the best of our knowledge, this is the first mechanism to enable input-dependent memory decay. It achieves substantial improvements over baseline models across associative recall, selective copying, and language modeling tasks, particularly mitigating the performance degradation caused by fixed decay parameters in long-sequence scenarios.
📝 Abstract
Sequence models face a fundamental tradeoff between memory capacity and computational efficiency. Transformers achieve expressive context modeling at quadratic cost, while linear attention and state-space models run in linear time by compressing context into a fixed-size hidden state, inherently limiting recall. Log-linear attention navigates this tradeoff by organizing memory across a Fenwick tree hierarchy, growing its hidden state logarithmically with sequence length at log-linear compute cost. However, its memory decay parameter λ is fixed and independent of the input, assigning uniform weights across all hierarchy levels regardless of the content, which introduces unnecessary rigidity. We propose learning λ directly from the input via a lightweight two-layer MLP, producing per-token, per-level decay that adapts to content rather than position. A softplus activation lets each Fenwick tree level scale independently, avoiding the inter-level competition that softmax introduces. This modification preserves log-linear complexity exactly and adds negligible parameter overhead. We evaluate on associative recall, selective copying, and language modeling, finding that input-dependent decay consistently outperforms the baseline, with the largest gains in long-range memory settings where baseline λ degrades or collapses entirely.