End-to-End Long Document Summarization using Gradient Caching

📅 2025-01-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address input truncation caused by memory constraints in long-document summarization, this paper proposes CachED—a mechanism enabling end-to-end training of Transformer encoder-decoder models on ultra-long documents (≥500K tokens). CachED integrates three key techniques: non-overlapping sliding-window chunked encoding, decoder gradient caching, and encoder hidden-state recomputation—without introducing any additional parameters. This design eliminates the train-test inconsistency bottleneck inherent in conventional truncation-based approaches. Built upon an extended BART architecture, CachED achieves state-of-the-art performance across multiple long-document summarization benchmarks, significantly outperforming truncation baselines. It is the first method to enable parameter-free, full-document, end-to-end training for such models, establishing new SOTA results while maintaining architectural simplicity and computational efficiency.

Technology Category

Application Category

📝 Abstract
Training transformer-based encoder-decoder models for long document summarization poses a significant challenge due to the quadratic memory consumption during training. Several approaches have been proposed to extend the input length at test time, but training with these approaches is still difficult, requiring truncation of input documents and causing a mismatch between training and test conditions. In this work, we propose CachED (Gradient $ extbf{Cach}$ing for $ extbf{E}$ncoder-$ extbf{D}$ecoder models), an approach that enables end-to-end training of existing transformer-based encoder-decoder models, using the entire document without truncation. Specifically, we apply non-overlapping sliding windows to input documents, followed by fusion in decoder. During backpropagation, the gradients are cached at the decoder and are passed through the encoder in chunks by re-computing the hidden vectors, similar to gradient checkpointing. In the experiments on long document summarization, we extend BART to CachED BART, processing more than 500K tokens during training and achieving superior performance without using any additional parameters.
Problem

Research questions and friction points this paper is trying to address.

Transformer Models
Long Document Summarization
Memory Limitations
Innovation

Methods, ideas, or system contributions that make the work stand out.

CachED
Long Document Summarization
Transformer Models
🔎 Similar Papers
No similar papers found.