🤖 AI Summary
Existing dense video captioning methods struggle with long videos due to high computational complexity, memory constraints, and reliance on full-video inputs—precluding online streaming processing. To address this, we propose a temporally extended State-Space Model (SSM) with a state propagation mechanism, overcoming the inherent state decay and memory bottlenecks of conventional SSMs in ultra-long sequences. This is the first work to enable efficient online modeling of extremely long videos using SSMs. Leveraging the intrinsic recurrent structure and linear-time complexity of SSMs, our model supports real-time, frame- or segment-wise event segmentation and caption generation. On dense video captioning benchmarks, it reduces FLOPs by 7×, significantly improving both efficiency and latency for long-video processing. Our approach establishes a novel paradigm for streaming video understanding.
📝 Abstract
Dense video captioning is a challenging video understanding task which aims to simultaneously segment the video into a sequence of meaningful consecutive events and to generate detailed captions to accurately describe each event. Existing methods often encounter difficulties when working with the long videos associated with dense video captioning, due to the computational complexity and memory limitations. Furthermore, traditional approaches require the entire video as input, in order to produce an answer, which precludes online processing of the video. We address these challenges by time-scaling State-Space Models (SSMs) to even longer sequences than before. Our approach, State-Space Models with Transfer State, combines both the long-sequence and recurrent properties of SSMs and addresses the main limitation of SSMs which are otherwise not able to sustain their state for very long contexts, effectively scaling SSMs further in time. The proposed model is particularly suitable for generating captions on-the-fly, in an online or streaming manner, without having to wait for the full video to be processed, which is more beneficial in practice. When applied to dense video captioning, our approach scales well with video lengths and uses 7x fewer FLOPs.