Self-Supervised Sparse Sensor Fusion for Long Range Perception

📅 2025-08-19

📈 Citations: 0

✨ Influential: 0

career value

271K/year

🤖 AI Summary

To address the urgent need for long-range perception (>250 m) in highway autonomous driving—especially for high-inertia heavy-duty trucks—existing dense bird’s-eye-view (BEV)-based methods suffer from quadratic computational and memory overhead with increasing range, as well as poor generalization. This paper proposes a sparse 3D multimodal spatiotemporal feature encoding framework that fuses camera and LiDAR temporal sequences, coupled with a novel self-supervised pretraining paradigm. It is the first to enable long-range perception modeling at scale using unlabeled multimodal data. By leveraging sparse representation and efficient 3D encoding, our method significantly reduces computational cost while improving detection mAP by 26.6% and reducing LiDAR point cloud prediction Chamfer Distance by 30.5%. This breakthrough achieves joint optimization of accuracy and efficiency for far-field perception.

Technology Category

Application Category

📝 Abstract

Outside of urban hubs, autonomous cars and trucks have to master driving on intercity highways. Safe, long-distance highway travel at speeds exceeding 100 km/h demands perception distances of at least 250 m, which is about five times the 50-100m typically addressed in city driving, to allow sufficient planning and braking margins. Increasing the perception ranges also allows to extend autonomy from light two-ton passenger vehicles to large-scale forty-ton trucks, which need a longer planning horizon due to their high inertia. However, most existing perception approaches focus on shorter ranges and rely on Bird's Eye View (BEV) representations, which incur quadratic increases in memory and compute costs as distance grows. To overcome this limitation, we built on top of a sparse representation and introduced an efficient 3D encoding of multi-modal and temporal features, along with a novel self-supervised pre-training scheme that enables large-scale learning from unlabeled camera-LiDAR data. Our approach extends perception distances to 250 meters and achieves an 26.6% improvement in mAP in object detection and a decrease of 30.5% in Chamfer Distance in LiDAR forecasting compared to existing methods, reaching distances up to 250 meters. Project Page: https://light.princeton.edu/lrs4fusion/

Problem

Research questions and friction points this paper is trying to address.

Extending autonomous vehicle perception range to 250 meters

Reducing computational costs for long-range sensor fusion

Enabling highway autonomy for heavy trucks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse 3D encoding for multimodal temporal features

Self-supervised pre-training with unlabeled camera-LiDAR data

Efficient representation extending perception to 250 meters

🔎 Similar Papers

Robust Long-Range Perception Against Sensor Misalignment in Autonomous Vehicles