🤖 AI Summary
Transformer-based models incur prohibitive O(n²) computational complexity when processing long sequences in autonomous driving, while multimodal and temporal fusion typically rely on hand-crafted, explicit modules. To address these limitations, we propose UniLION—the first unified architecture for autonomous driving that integrates linear-group RNNs. UniLION replaces self-attention with linear-complexity recurrent modeling, enabling native support for LiDAR point clouds, multi-view images, and sequential data without dedicated fusion modules. It jointly models multiple modalities and tasks—including 3D detection, tracking, occupancy prediction, BEV segmentation, motion forecasting, and end-to-end planning—within a single framework. The architecture supports flexible configurations (e.g., LiDAR-only, multimodal, or temporal fusion), eliminating the conventional attention-plus-explicit-fusion paradigm. On key benchmarks, UniLION achieves state-of-the-art or competitive performance while significantly improving efficiency for long-sequence processing and enhancing model generalization.
📝 Abstract
Although transformers have demonstrated remarkable capabilities across various domains, their quadratic attention mechanisms introduce significant computational overhead when processing long-sequence data. In this paper, we present a unified autonomous driving model, UniLION, which efficiently handles large-scale LiDAR point clouds, high-resolution multi-view images, and even temporal sequences based on the linear group RNN operator (i.e., performs linear RNN for grouped features). Remarkably, UniLION serves as a single versatile architecture that can seamlessly support multiple specialized variants (i.e., LiDAR-only, temporal LiDAR, multi-modal, and multi-modal temporal fusion configurations) without requiring explicit temporal or multi-modal fusion modules. Moreover, UniLION consistently delivers competitive and even state-of-the-art performance across a wide range of core tasks, including 3D perception (e.g., 3D object detection, 3D object tracking, 3D occupancy prediction, BEV map segmentation), prediction (e.g., motion prediction), and planning (e.g., end-to-end planning). This unified paradigm naturally simplifies the design of multi-modal and multi-task autonomous driving systems while maintaining superior performance. Ultimately, we hope UniLION offers a fresh perspective on the development of 3D foundation models in autonomous driving. Code is available at https://github.com/happinesslz/UniLION