Modality Agnostic Efficient Long Range Encoder

📅 2025-07-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the computational and memory bottlenecks arising from the quadratic complexity of self-attention in single-device long-context modeling—and the limitations of existing token merging or attention approximation methods, which suffer from strong modality coupling and suboptimal accuracy-efficiency trade-offs—this paper proposes a modality-agnostic, progressive long-range encoding architecture. Its core contributions are: (1) lightweight dynamic token merging, which adaptively aggregates tokens based on semantic redundancy; and (2) a staged hybrid attention mechanism that employs efficient approximate attention in coarse-grained stages and reverts to standard dot-product attention in fine-grained stages to preserve critical dependencies. Evaluated across text, time-series, audio, and vision classification tasks, the method reduces FLOPs and GPU memory consumption by 42% on average while achieving a 1.3% average accuracy gain over state-of-the-art long-context models.

Technology Category

Application Category

📝 Abstract
The long-context capability of recent large transformer models can be surmised to rely on techniques such as attention/model parallelism, as well as hardware-level optimizations. While these strategies allow input lengths to scale to millions of tokens, they do not fundamentally mitigate the quadratic computational and memory complexity of the core attention mechanism. In this paper, we address the challenge of long-context processing on a single device using generic implementations by reducing the quadratic memory footprint and inference cost. Existing approaches to extend the context length for generic single device implementations -- such as token merging and modified attentions -- are often modality specific and attain a suboptimal tradeoff between accuracy and efficiency. To overcome these limitations, we propose MAELRE (Modality Agnostic Efficient Long Range Encoder), a unified and efficient transformer architecture designed for long-range encoding across diverse modalities. MAELRE integrates token merging with attention approximation, progressively merging tokens at different stages of internal computational blocks. It employs a lightweight attention approximation when the number of tokens is large, and switches to standard dot-product attention as the sequence becomes shorter through successive aggregation. We demonstrate that MAELRE achieves superior accuracy while reducing computational cost compared to existing long-context models on classification tasks spanning multiple modalities, including text, time series, audio, and vision.
Problem

Research questions and friction points this paper is trying to address.

Reduces quadratic memory and computational costs in long-context transformers
Overcomes modality-specific limitations in existing long-range encoding methods
Unifies token merging and attention approximation for efficient multi-modal processing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Modality agnostic long-range encoder design
Token merging with attention approximation
Progressive switching to standard attention