DeepCoT: Deep Continual Transformers for Real-Time Inference on Data Streams

πŸ“… 2025-11-21
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing continual Transformers address redundant computation from sliding windows in real-time streaming inference but are limited to shallow architectures, suffering from poor generalizability and practicality. Method: This work pioneers the extension of continual inference to deep Transformer encoders, proposing a Deep Continual Transformer (DCT) supporting multimodal (audio/video/text) streaming data. Its core innovations are an incremental attention mechanism and a cross-layer state propagation strategy, explicitly eliminating duplicate computation induced by window overlap. Contribution/Results: DCT achieves modeling performance on par with full-sequence Transformers while reducing inference complexity to linear time. Empirical evaluation shows up to 100Γ— speedup over state-of-the-art efficient models, enabling low-latency deployment of deep Transformers on resource-constrained devices.

Technology Category

Application Category

πŸ“ Abstract
Transformer-based models have dramatically increased their size and parameter count to tackle increasingly complex tasks. At the same time, there is a growing demand for low-latency inference on resource-constrained devices that achieves high performance. In particular, stream data inference is typically performed over a sliding temporal window, leading to highly redundant computations. The recent Continual Transformers have addressed this issue, but they can only be effectively used in shallow models, which limits their scope and generalization power. In this paper, we propose the Deep Continual Transformer (DeepCoT), a redundancy-free encoder-only model that can be applied over existing deep encoder architectures with minimal changes. In our experiments over audio, video, and text streams, we show that DeepCoTs retain comparative performance to their non-continual baselines while offering a linear computational cost for all Transformer layers, which reduces up to two orders of magnitude in the running time compared to previous efficient models.
Problem

Research questions and friction points this paper is trying to address.

Reducing computational redundancy in stream data inference
Enabling low-latency inference on resource-constrained devices
Extending continual transformers to deep encoder architectures
Innovation

Methods, ideas, or system contributions that make the work stand out.

Deep encoder architecture with minimal changes
Redundancy-free encoder-only model design
Linear computational cost for all layers
πŸ”Ž Similar Papers
No similar papers found.
G
GinΓ©s Carreto PicΓ³n
Department of Electrical and Computer Engineering, Aarhus University, Denmark
Peng Yuan Zhou
Peng Yuan Zhou
Assistant Professor, ECE, Aarhus University
Extended RealityArtificial Intelligence
Q
Qi Zhang
Department of Electrical and Computer Engineering, Aarhus University, Denmark
Alexandros Iosifidis
Alexandros Iosifidis
Professor, Dept. of Computing Sciences, Tampere University
Computational IntelligenceMachine LearningMachine PerceptionFinancial Data Analytics