🤖 AI Summary
Addressing the challenges of modeling ultra-long contexts (up to millions of tokens) and the lack of efficient architectures for video-language joint understanding, this paper introduces Blockwise RingAttention—a memory-efficient attention mechanism enabling scalable training and inference over extremely long sequences. Our method incorporates blockwise context expansion (from 4K to 1M tokens), large-scale long-sequence data cleaning and synthesis, and video-text alignment pretraining. Leveraging these techniques, we present the first open-source multimodal foundation model supporting million-token contexts, with 7B parameters. The model achieves state-of-the-art performance on long-context retrieval tasks and processes over one million tokens in a single forward pass—equivalent to approximately 1.5 hours of 4K video plus associated text. We publicly release an extensible training framework and model weights across multiple scales, establishing a new paradigm for long-horizon multimodal sequence modeling.
📝 Abstract
Enabling long-context understanding remains a key challenge in scaling existing sequence models -- a crucial component in developing generally intelligent models that can process and operate over long temporal horizons that potentially consist of millions of tokens. In this paper, we aim to address these challenges by providing a comprehensive exploration of the full development process for producing 1M context language models and video-language models, setting new benchmarks in language retrieval and new capabilities in long video understanding. We detail our long context data curation process, progressive context extension from 4K to 1M tokens, and present an efficient open-source implementation for scalable training on long sequences. Additionally, we open-source a family of 7B parameter models capable of processing long text documents and videos exceeding 1M tokens.