LiteViLNet: Lightweight Vision-LiDAR Fusion Network for Efficient Road Segmentation

📅 2026-05-20

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This work addresses the high computational cost of existing multimodal road segmentation methods, which hinders real-time deployment on edge devices. The authors propose LiteViLNet, a lightweight dual-stream encoder that effectively fuses RGB and LiDAR modalities. By incorporating a multi-scale feature fusion module (MSFM) and a large-kernel bridging module, the model captures long-range dependencies with linear complexity while maintaining only 14.04 million parameters. LiteViLNet achieves a MaxF score of 96.36% and delivers exceptional inference speeds—163.79 FPS on an RTX 4060 Ti GPU and 22.18 FPS on a Jetson Orin NX embedded platform—significantly outperforming current heavyweight models in both efficiency and real-time capability.

📝 Abstract

Road segmentation is a fundamental perception task for autonomous driving and intelligent robotic systems, requiring both high accuracy and real-time inference, especially for deployment on resource-constrained edge devices. Existing multi-modal road segmentation methods often rely on heavy transformer-based encoders to achieve state-of-the-art performance, but their enormous computational cost prohibits real-time deployment on embedded platforms. To address this dilemma, we propose \textbf{LiteViLNet}, a lightweight multi-modal network that fuses RGB texture information and LiDAR geometric information for efficient road segmentation. Specifically, we design a dual-stream lightweight encoder and depth-wise separable convolutions to extract hierarchical features from both modalities with minimal parameters. We further propose a Multi-Scale Feature Fusion Module (MSFM) to facilitate cross-modal interaction at different levels, and a large-kernel-bridge module to capture long-range dependencies with linear complexity. Extensive experiments on the KITTI Road dataset and real-world applications demonstrate that LiteViLNet achieves a promising balance between accuracy and efficiency. Notably, with only 14.04M parameters, our model attains a 96.36\% MaxF score, ranking the best among all CNN-based methods and being comparable to larger transformer-based models, and runs at 163.79 FPS in model-only inference on RTX 4060 Ti (22.18 FPS on Jetson Orin NX). It outperforms numerous heavy-weight methods in inference speed while maintaining highly competitive accuracy, fully validating the potential of LiteViLNet for real-time embedded deployment in autonomous driving and intelligent robotics.

Problem

Research questions and friction points this paper is trying to address.

road segmentation

real-time inference

resource-constrained devices

multi-modal fusion

computational efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

lightweight network

vision-LiDAR fusion

multi-scale feature fusion