ScaleFormer: Span Representation Cumulation for Long-Context Transformer

๐Ÿ“… 2025-11-13
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
The quadratic complexity of standard self-attention limits Transformer applicability to long-text tasks, while existing efficient variants typically require architectural modifications and full pretraining. This paper proposes a plug-and-play framework that adapts off-the-shelf encoder-decoder models to ultra-long sequences without altering their architecture or updating pretrained weights. Our core innovation is a parameter-free, cross-chunk cumulative fusion mechanism: by processing overlapping input segments and compressing contextual representations, it enhances structural awareness and narrative coherence across segment boundariesโ€”all at linear computational complexity. Experiments on long-document summarization demonstrate that our method achieves or surpasses state-of-the-art performance without auxiliary retrieval modules or from-scratch training. It significantly improves inference efficiency and enables scalable context extension, offering a practical solution for deploying pretrained Transformers on extended-length inputs.

Technology Category

Application Category

๐Ÿ“ Abstract
The quadratic complexity of standard self-attention severely limits the application of Transformer-based models to long-context tasks. While efficient Transformer variants exist, they often require architectural changes and costly pre-training from scratch. To circumvent this, we propose ScaleFormer(Span Representation Cumulation for Long-Context Transformer) - a simple and effective plug-and-play framework that adapts off-the-shelf pre-trained encoder-decoder models to process long sequences without requiring architectural modifications. Our approach segments long inputs into overlapping chunks and generates a compressed, context-aware representation for the decoder. The core of our method is a novel, parameter-free fusion mechanism that endows each chunk's representation with structural awareness of its position within the document. It achieves this by enriching each chunk's boundary representations with cumulative context vectors from all preceding and succeeding chunks. This strategy provides the model with a strong signal of the document's narrative flow, achieves linear complexity, and enables pre-trained models to reason effectively over long-form text. Experiments on long-document summarization show that our method is highly competitive with and often outperforms state-of-the-art approaches without requiring architectural modifications or external retrieval mechanisms.
Problem

Research questions and friction points this paper is trying to address.

Reduces quadratic complexity of self-attention for long sequences
Enables pre-trained models to handle long contexts without architectural changes
Generates compressed representations with structural document position awareness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Segments long inputs into overlapping chunks
Generates compressed context-aware representations for decoder
Uses parameter-free fusion with cumulative context vectors
๐Ÿ”Ž Similar Papers
No similar papers found.