🤖 AI Summary
Existing V2X cooperative perception methods for autonomous driving suffer from poor generalizability, shallow contextual reasoning, and over-reliance on single-modal inputs, failing to adequately address multi-vehicle collisions caused by human error. Meanwhile, vision-language models (VLMs) struggle to balance real-time performance with safety-critical reliability. Method: This paper proposes a lightweight vision–language-driven V2X cooperative perception and trajectory optimization framework. It introduces a novel language-guided contextual reasoning mechanism, integrating multimodal sensor inputs, risk-aware trajectory planning, and edge-deployment optimization. Contribution/Results: The framework achieves unified semantic-level scene understanding and end-to-end real-time decision-making. After fine-tuning for the Jetson AGX Orin platform, it reduces collision rate by 77%, improves Vehicle Perception Quality (VPQ) by 48.2%, and achieves an inference latency of only 0.57 seconds on the DeepAccident benchmark—setting a new state-of-the-art for this task.
📝 Abstract
Collisions caused by human error are the most common type of multi-vehicle crash, highlighting the critical need for autonomous driving (AD) systems to leverage cooperative perception through Vehicle-to-Everything (V2X) communication. This capability extends situational awareness beyond the limitations of onboard sensors. However, current transformer-based V2X frameworks suffer from limited generalization, shallow contextual reasoning, and reliance on mono-modal inputs. Vision-Language Models (VLMs) offer enhanced reasoning and multimodal integration but typically fall short of real-time performance requirements in safety-critical applications. This paper presents REACT, a real-time, V2X-integrated trajectory optimization framework built upon a fine-tuned lightweight VLM. REACT integrates a set of specialized modules that process multimodal inputs into optimized, risk-aware trajectories. To ensure real-time performance on edge devices, REACT incorporates edge adaptation strategies that reduce model complexity and accelerate inference. Evaluated on the DeepAccident benchmark, REACT achieves state-of-the-art performance, a 77% collision rate reduction, a 48.2% Video Panoptic Quality (VPQ), and a 0.57-second inference latency on the Jetson AGX Orin. Ablation studies validate the contribution of each input, module, and edge adaptation strategy. These results demonstrate the feasibility of lightweight VLMs for real-time edge-based cooperative planning and showcase the potential of language-guided contextual reasoning to improve safety and responsiveness in autonomous driving.