LiteViLNet: Lightweight Vision-LiDAR Fusion Network for Efficient Road Segmentation

📅 2026-05-20
📈 Citations: 0
Influential: 0
📄 PDF

career value

199K/year
🤖 AI Summary
This work addresses the high computational cost of existing multimodal road segmentation methods, which hinders real-time deployment on edge devices. The authors propose LiteViLNet, a lightweight dual-stream encoder that effectively fuses RGB and LiDAR modalities. By incorporating a multi-scale feature fusion module (MSFM) and a large-kernel bridging module, the model captures long-range dependencies with linear complexity while maintaining only 14.04 million parameters. LiteViLNet achieves a MaxF score of 96.36% and delivers exceptional inference speeds—163.79 FPS on an RTX 4060 Ti GPU and 22.18 FPS on a Jetson Orin NX embedded platform—significantly outperforming current heavyweight models in both efficiency and real-time capability.
📝 Abstract
Road segmentation is a fundamental perception task for autonomous driving and intelligent robotic systems, requiring both high accuracy and real-time inference, especially for deployment on resource-constrained edge devices. Existing multi-modal road segmentation methods often rely on heavy transformer-based encoders to achieve state-of-the-art performance, but their enormous computational cost prohibits real-time deployment on embedded platforms. To address this dilemma, we propose \textbf{LiteViLNet}, a lightweight multi-modal network that fuses RGB texture information and LiDAR geometric information for efficient road segmentation. Specifically, we design a dual-stream lightweight encoder and depth-wise separable convolutions to extract hierarchical features from both modalities with minimal parameters. We further propose a Multi-Scale Feature Fusion Module (MSFM) to facilitate cross-modal interaction at different levels, and a large-kernel-bridge module to capture long-range dependencies with linear complexity. Extensive experiments on the KITTI Road dataset and real-world applications demonstrate that LiteViLNet achieves a promising balance between accuracy and efficiency. Notably, with only 14.04M parameters, our model attains a 96.36\% MaxF score, ranking the best among all CNN-based methods and being comparable to larger transformer-based models, and runs at 163.79 FPS in model-only inference on RTX 4060 Ti (22.18 FPS on Jetson Orin NX). It outperforms numerous heavy-weight methods in inference speed while maintaining highly competitive accuracy, fully validating the potential of LiteViLNet for real-time embedded deployment in autonomous driving and intelligent robotics.
Problem

Research questions and friction points this paper is trying to address.

road segmentation
real-time inference
resource-constrained devices
multi-modal fusion
computational efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

lightweight network
vision-LiDAR fusion
multi-scale feature fusion
depth-wise separable convolution
real-time road segmentation
🔎 Similar Papers
No similar papers found.