🤖 AI Summary
Modeling real-world time series with non-uniform sampling and long-range dependencies remains challenging: RNNs fail to capture continuous dynamics, while Neural ODEs and Transformers suffer from high computational cost and lack of sampling invariance. To address this, we propose the lightweight Continuous-Time Transformer (CT-Transformer), which introduces a novel multi-view signature attention mechanism—integrating path signatures into self-attention to robustly model local dynamics and multi-scale temporal patterns. Our architecture unifies the continuous-time representation capability of Neural ODEs with the global dependency modeling strength of Transformers, while preserving invariance to sequence length and sampling frequency. Experiments demonstrate that CT-Transformer achieves accuracy on par with Neural ODEs and significantly outperforms standard Transformers across multiple time-series tasks. Moreover, it reduces inference latency and memory consumption by over 40%, offering a computationally efficient yet expressive alternative for continuous-time sequence modeling.
📝 Abstract
Time-series data in real-world settings typically exhibit long-range dependencies and are observed at non-uniform intervals. In these settings, traditional sequence-based recurrent models struggle. To overcome this, researchers often replace recurrent architectures with Neural ODE-based models to account for irregularly sampled data and use Transformer-based architectures to account for long-range dependencies. Despite the success of these two approaches, both incur very high computational costs for input sequences of even moderate length. To address this challenge, we introduce the Rough Transformer, a variation of the Transformer model that operates on continuous-time representations of input sequences and incurs significantly lower computational costs. In particular, we propose multi-view signature attention, which uses path signatures to augment vanilla attention and to capture both local and global (multi-scale) dependencies in the input data, while remaining robust to changes in the sequence length and sampling frequency and yielding improved spatial processing. We find that, on a variety of time-series-related tasks, Rough Transformers consistently outperform their vanilla attention counterparts while obtaining the representational benefits of Neural ODE-based models, all at a fraction of the computational time and memory resources.