🤖 AI Summary
To address the challenge of modeling long spatiotemporal multimodal sequences for end-to-end autonomous driving on resource-constrained edge devices, this paper proposes the first fully linear-attention-driven generative model. Methodologically, it breaks the quadratic complexity bottleneck of conventional Transformers by introducing a novel lightweight linear cross-attention mechanism—enabling efficient cross-modal (camera/LiDAR) and cross-temporal interactions at linear computational complexity, overcoming the limitation of existing linear attention methods that support only self-attention. The model jointly performs multi-sensor feature alignment and end-to-end trajectory generation. Experimentally, it achieves state-of-the-art planning performance on NAVSIM and Bench2Drive, with inference complexity invariant to historical sequence length. Furthermore, it is successfully deployed on edge platforms, significantly reducing both computational cost and memory footprint compared to prior approaches.
📝 Abstract
End-to-end paradigms have demonstrated great potential for autonomous driving. Additionally, most existing methods are built upon Transformer architectures. However, transformers incur a quadratic attention cost, limiting their ability to model long spatial and temporal sequences-particularly on resource-constrained edge platforms. As autonomous driving inherently demands efficient temporal modeling, this challenge severely limits their deployment and real-time performance. Recently, linear attention mechanisms have gained increasing attention due to their superior spatiotemporal complexity. However, existing linear attention architectures are limited to self-attention, lacking support for cross-modal and cross-temporal interactions-both crucial for autonomous driving. In this work, we propose LADY, the first fully linear attention-based generative model for end-to-end autonomous driving. LADY enables fusion of long-range temporal context at inference with constant computational and memory costs, regardless of the history length of camera and LiDAR features. Additionally, we introduce a lightweight linear cross-attention mechanism that enables effective cross-modal information exchange. Experiments on the NAVSIM and Bench2Drive benchmarks demonstrate that LADY achieves state-of-the-art performance with constant-time and memory complexity, offering improved planning performance and significantly reduced computational cost. Additionally, the model has been deployed and validated on edge devices, demonstrating its practicality in resource-limited scenarios.