Expansion Span: Combining Fading Memory and Retrieval in Hybrid State Space Models

📅 2024-12-17
🏛️ arXiv.org
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
Existing hybrid state space models (SSMs) and attention mechanisms suffer from fixed context windows, hindering efficient access to distant historical tokens. To address this, we propose Span-Expanded Attention (SE-Attn), which dynamically retrieves semantically relevant historical tokens and injects them into an expanded attention span—enabling state memory allocation based on semantic similarity rather than temporal proximity for the first time. We further introduce the Expansion Span mechanism and HyLoRA, a novel fine-tuning method that adapts Low-Rank Adaptation (LoRA) to hybrid SSM-attention architectures. Our approach enables efficient fine-tuning on sequences up to eight times longer than the pretraining context length. On long-context benchmarks—including PG-19 and RULER—it outperforms LongLoRA with lower computational overhead and higher modeling accuracy.

Technology Category

Application Category

📝 Abstract
The"state"of State Space Models (SSMs) represents their memory, which fades exponentially over an unbounded span. By contrast, Attention-based models have"eidetic"(i.e., verbatim, or photographic) memory over a finite span (context size). Hybrid architectures combine State Space layers with Attention, but still cannot recall the distant past and can access only the most recent tokens eidetically. Unlike current methods of combining SSM and Attention layers, we allow the state to be allocated based on relevancy rather than recency. In this way, for every new set of query tokens, our models can"eidetically"access tokens from beyond the Attention span of current Hybrid SSMs without requiring extra hardware resources. We describe a method to expand the memory span of the hybrid state by"reserving"a fraction of the Attention context for tokens retrieved from arbitrarily distant in the past, thus expanding the eidetic memory span of the overall state. We call this reserved fraction of tokens the"expansion span,"and the mechanism to retrieve and aggregate it"Span-Expanded Attention"(SE-Attn). To adapt Hybrid models to using SE-Attn, we propose a novel fine-tuning method that extends LoRA to Hybrid models (HyLoRA) and allows efficient adaptation on long spans of tokens. We show that SE-Attn enables us to efficiently adapt pre-trained Hybrid models on sequences of tokens up to 8 times longer than the ones used for pre-training. We show that HyLoRA with SE-Attn is cheaper and more performant than alternatives like LongLoRA when applied to Hybrid models on natural language benchmarks with long-range dependencies, such as PG-19, RULER, and other common natural language downstream tasks.
Problem

Research questions and friction points this paper is trying to address.

Extends memory span in Hybrid SSMs beyond current Attention limits
Enables eidetic memory access to distant past tokens efficiently
Improves performance on long-sequence tasks without extra hardware
Innovation

Methods, ideas, or system contributions that make the work stand out.

Expands memory span with relevancy-based token allocation
Introduces Span-Expanded Attention for distant token retrieval
Proposes HyLoRA for efficient long-span fine-tuning
🔎 Similar Papers
No similar papers found.
Elvis Nunez
Elvis Nunez
Applied Scientist, AWS
computer visionmachine learningoptimization
L
L. Zancato
AWS AI Labs
B
Benjamin Bowman
AWS AI Labs
A
Aditya Golatkar
AWS AI Labs
W
Wei Xia
AWS AI Labs
S
S. Soatto
AWS AI Labs