DriveMamba: Task-Centric Scalable State Space Model for Efficient End-to-End Autonomous Driving

📅 2026-02-09

📈 Citations: 0

✨ Influential: 0

career value

237K/year

🤖 AI Summary

This work proposes DriveMamba, a novel end-to-end autonomous driving framework that addresses the limitations of existing systems—namely, information loss and error accumulation from modular designs, as well as the computational inefficiency of high-complexity attention mechanisms in modeling dynamic multi-task and multi-sensor relationships. DriveMamba introduces, for the first time, a linear-complexity state space model into end-to-end driving, featuring a unified single-stage Mamba decoder. It leverages task-centric sparse token representations, 3D spatial position ordering, and a bidirectional trajectory-guided “local-to-global” scanning strategy to enable dynamic task relationship modeling, implicit view alignment, and long-range temporal fusion. Experiments on nuScenes and Bench2Drive demonstrate that DriveMamba significantly outperforms current methods in terms of performance, generalization, and computational efficiency.

Technology Category

Application Category

📝 Abstract

Recent advances towards End-to-End Autonomous Driving (E2E-AD) have been often devoted on integrating modular designs into a unified framework for joint optimization e.g. UniAD, which follow a sequential paradigm (i.e., perception-prediction-planning) based on separable Transformer decoders and rely on dense BEV features to encode scene representations. However, such manual ordering design can inevitably cause information loss and cumulative errors, lacking flexible and diverse relation modeling among different modules and sensors. Meanwhile, insufficient training of image backbone and quadratic-complexity of attention mechanism also hinder the scalability and efficiency of E2E-AD system to handle spatiotemporal input. To this end, we propose DriveMamba, a Task-Centric Scalable paradigm for efficient E2E-AD, which integrates dynamic task relation modeling, implicit view correspondence learning and long-term temporal fusion into a single-stage Unified Mamba decoder. Specifically, both extracted image features and expected task outputs are converted into token-level sparse representations in advance, which are then sorted by their instantiated positions in 3D space. The linear-complexity operator enables efficient long-context sequential token modeling to capture task-related inter-dependencies simultaneously. Additionally, a bidirectional trajectory-guided"local-to-global"scan method is designed to preserve spatial locality from ego-perspective, thus facilitating the ego-planning. Extensive experiments conducted on nuScenes and Bench2Drive datasets demonstrate the superiority, generalizability and great efficiency of DriveMamba.

Problem

Research questions and friction points this paper is trying to address.

End-to-End Autonomous Driving

information loss

cumulative errors

scalability

efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

State Space Model

End-to-End Autonomous Driving

Task-Centric Modeling