🤖 AI Summary
Standard Transformers struggle to efficiently process ultra-long sequences due to their quadratic computational complexity and ever-growing key-value (KV) cache. This work proposes a plug-and-play Collaborative Memory Transformer architecture that processes sequences in chunks and dynamically generates soft prompts via a dual-memory mechanism—comprising a FIFO temporary queue and a gated global memory. This design maintains constant memory usage and linear time complexity while overcoming the conventional attention length limitations. The approach innovatively integrates a collaborative dual-memory scheme with an inter-layer pipelined fine-tuning strategy, enabling accurate retrieval of information from arbitrary positions within million-token sequences after fine-tuning on only 32k-context data. It matches the performance of full-attention baselines on the SCROLLS summarization benchmark and demonstrates strong effectiveness on real-world user behavior question-answering tasks.
📝 Abstract
The quadratic complexity and indefinitely growing key-value (KV) cache of standard Transformers pose a major barrier to long-context processing. To overcome this, we introduce the Collaborative Memory Transformer (CoMeT), a novel architecture that enables LLMs to handle arbitrarily long sequences with constant memory usage and linear time complexity. Designed as an efficient, plug-in module, CoMeT can be integrated into pre-trained models with only minimal fine-tuning. It operates on sequential data chunks, using a dual-memory system to manage context: a temporary memory on a FIFO queue for recent events, and a global memory with a gated update rule for long-range dependencies. These memories then act as a dynamic soft prompt for the next chunk. To enable efficient fine-tuning on extremely long contexts, we introduce a novel layer-level pipeline parallelism strategy. The effectiveness of our approach is remarkable: a model equipped with CoMeT and fine-tuned on 32k contexts can accurately retrieve a passkey from any position within a 1M token sequence. On the SCROLLS benchmark, CoMeT surpasses other efficient methods and achieves performance comparable to a full-attention baseline on summarization tasks. Its practical effectiveness is further validated on real-world agent and user behavior QA tasks. The code is available at: https://anonymous.4open.science/r/comet-B00B/