🤖 AI Summary
To address the insufficient perception-decision coordination, error accumulation in modular pipelines, and poor real-time performance of end-to-end models in complex urban autonomous driving scenarios, this paper proposes Mamba-BEV: an end-to-end, real-time decision-making framework integrating bird’s-eye-view (BEV) perception with deep reinforcement learning. Its core innovations include: (1) a spatiotemporal feature extraction network based on the Mamba architecture to efficiently model long-range spatiotemporal dependencies; (2) a semantic segmentation visualization mechanism to enhance model interpretability; and (3) a unified joint training paradigm for BEV representation learning and policy optimization. Evaluated on the CARLA simulator, Mamba-BEV achieves a 28.6% reduction in collision rate, a 21.4% improvement in trajectory tracking accuracy, and maintains real-time inference latency below 50 ms. These results demonstrate its effectiveness in enabling safe, interpretable, efficient, and robust autonomous driving in highly dynamic urban environments.
📝 Abstract
Autonomous driving systems face significant challenges in perceiving complex environments and making real-time decisions. Traditional modular approaches, while offering interpretability, suffer from error propagation and coordination issues, whereas end-to-end learning systems can simplify the design but face computational bottlenecks. This paper presents a novel approach to autonomous driving using deep reinforcement learning (DRL) that integrates bird's-eye view (BEV) perception for enhanced real-time decision-making. We introduce the exttt{Mamba-BEV} model, an efficient spatio-temporal feature extraction network that combines BEV-based perception with the Mamba framework for temporal feature modeling. This integration allows the system to encode vehicle surroundings and road features in a unified coordinate system and accurately model long-range dependencies. Building on this, we propose the exttt{ME$^3$-BEV} framework, which utilizes the exttt{Mamba-BEV} model as a feature input for end-to-end DRL, achieving superior performance in dynamic urban driving scenarios. We further enhance the interpretability of the model by visualizing high-dimensional features through semantic segmentation, providing insight into the learned representations. Extensive experiments on the CARLA simulator demonstrate that exttt{ME$^3$-BEV} outperforms existing models across multiple metrics, including collision rate and trajectory accuracy, offering a promising solution for real-time autonomous driving.