🤖 AI Summary
To address the high computational cost and poor deployability of Transformer-based multimodal models on edge devices for high-level autonomous driving decision-making, this paper proposes an end-to-end multimodal reinforcement learning framework tailored for real-time decision-making. The method introduces a Transformer-like architecture built upon ternary spiking neurons, enabling efficient fusion of heterogeneous inputs—including camera images, LiDAR point clouds, and vehicle pose data. It further incorporates spike-timing-aware mechanisms and a cross-attention module to preserve multimodal representation fidelity while drastically reducing computational complexity. Experimental evaluation on the Highway Environment benchmark demonstrates that the proposed approach achieves comparable or superior decision accuracy across multiple tasks, with a 42% reduction in inference latency and a 58% decrease in power consumption—thereby satisfying stringent real-time and energy-efficiency constraints of in-vehicle edge platforms.
📝 Abstract
This work proposes an end-to-end multi-modal reinforcement learning framework for high-level decision-making in autonomous vehicles. The framework integrates heterogeneous sensory input, including camera images, LiDAR point clouds, and vehicle heading information, through a cross-attention transformer-based perception module. Although transformers have become the backbone of modern multi-modal architectures, their high computational cost limits their deployment in resource-constrained edge environments. To overcome this challenge, we propose a spiking temporal-aware transformer-like architecture that uses ternary spiking neurons for computationally efficient multi-modal fusion. Comprehensive evaluations across multiple tasks in the Highway Environment demonstrate the effectiveness and efficiency of the proposed approach for real-time autonomous decision-making.