Caracal: Causal Architecture via Spectral Mixing

📅 2026-04-30

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

This work addresses the limitations of large language models in long-sequence modeling, which stem from the quadratic complexity of attention mechanisms and bottlenecks in positional encoding. The authors propose Caracal, a novel architecture that introduces the Fast Fourier Transform (FFT) into causal sequence modeling for the first time, replacing conventional attention with multi-head Fourier modules to achieve O(L log L) efficient sequence mixing. A frequency-domain causal mask is carefully designed to preserve autoregressive generation capabilities. Relying solely on standard deep learning operators, Caracal offers strong portability and deployment compatibility while matching or surpassing the performance of Transformers and state space models on long-sequence tasks.

📝 Abstract

The scalability of Large Language Models to long sequences is hindered by the quadratic cost of attention and the limitations of positional encodings. To address these, we introduce Caracal, a novel architecture that replaces attention with a parameter-efficient, $\mathcal{O}(L \log L)$ Multi-Head Fourier (MHF) module. Our contributions are threefold: (1) We leverage the Fast Fourier Transform (FFT) for sequence mixing, inherently addressing both bottlenecks mentioned above. (2) We apply a frequency-domain causal masking technique that enforces autoregressive capabilities via asymmetric padding and truncation, overcoming a critical barrier for Fourier-based generative models. (3) Unlike efficient models relying on hardware-specific implementations (e.g., Mamba), we uses standard library operators. This ensures robust portability, eliminating common deployment barriers. Evaluations demonstrate that Caracal performs competitively with Transformer and SSM baselines, offering a scalable and simple pathway for efficient long-sequence modeling. Code is available in Appendix.

Problem

Research questions and friction points this paper is trying to address.

long-sequence modeling

attention complexity

positional encoding

scalability

autoregressive generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fourier-based architecture

causal masking

efficient sequence modeling

O(L log L) complexity

hardware-agnostic design

🔎 Similar Papers

No similar papers found.

💼 Related Jobs

Research Engineer, Monetization AI