Understanding and Mitigating Bottlenecks of State Space Models through the Lens of Recency and Over-smoothing

📅 2024-12-31

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

Structured State Space Models (SSMs) face an inherent trade-off: recency bias impairs long-range dependency modeling, while depth scaling exacerbates oversmoothing in state transition matrices, degrading representational discriminability. This work is the first to systematically identify and quantify the oversmoothing phenomenon in deep SSMs. We propose a dual-channel polarized state transition matrix design that jointly regulates dynamics in both frequency and time domains, simultaneously mitigating recency bias and oversmoothing. Through rigorous theoretical analysis, controlled ablation studies with scalable depth, and long-range associative recall evaluation, we demonstrate that our approach significantly improves deep SSMs’ accuracy in recalling distant tokens (+12.7%), enhances depth scalability, and strengthens robustness in long-range modeling. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract

Structured State Space Models (SSMs) have emerged as alternatives to transformers. While SSMs are often regarded as effective in capturing long-sequence dependencies, we rigorously demonstrate that they are inherently limited by strong recency bias. Our empirical studies also reveal that this bias impairs the models' ability to recall distant information and introduces robustness issues. Our scaling experiments then discovered that deeper structures in SSMs can facilitate the learning of long contexts. However, subsequent theoretical analysis reveals that as SSMs increase in depth, they exhibit another inevitable tendency toward over-smoothing, e.g., token representations becoming increasingly indistinguishable. This fundamental dilemma between recency and over-smoothing hinders the scalability of existing SSMs. Inspired by our theoretical findings, we propose to polarize two channels of the state transition matrices in SSMs, setting them to zero and one, respectively, simultaneously addressing recency bias and over-smoothing. Experiments demonstrate that our polarization technique consistently enhances the associative recall accuracy of long-range tokens and unlocks SSMs to benefit further from deeper architectures. All source codes are released at https://github.com/VITA-Group/SSM-Bottleneck.

Problem

Research questions and friction points this paper is trying to address.

Structured State Space Models

Long Sequence Data

Model Complexity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Structured State Space Models

Attention Mechanism

Sequence Handling

🔎 Similar Papers

State-Space Modeling in Long Sequence Processing: A Survey on Recurrence in the Transformer Era