🤖 AI Summary
To address insufficient perception robustness of autonomous navigation robots in complex dynamic environments, this paper proposes a lightweight multimodal temporal fusion framework. It employs an efficient CNN-PointNet hybrid feature extraction network, introduces an attention-driven adaptive cross-modal weighting mechanism to dynamically balance RGB and LiDAR feature contributions, and leverages Temporal Convolutional Networks (TCNs) to model temporal dependencies for enhanced motion consistency understanding. Evaluated on the KITTI dataset, the method achieves a 3.5% improvement in navigation accuracy and a 2.2% gain in localization accuracy, while maintaining real-time inference speed (>15 FPS). The core contribution is an end-to-end multimodal temporal fusion paradigm that jointly optimizes computational efficiency and perception robustness, significantly improving generalization capability in cluttered and dynamic scenes.
📝 Abstract
This paper introduces a novel deep learning-based multimodal fusion architecture aimed at enhancing the perception capabilities of autonomous navigation robots in complex environments. By utilizing innovative feature extraction modules, adaptive fusion strategies, and time-series modeling mechanisms, the system effectively integrates RGB images and LiDAR data. The key contributions of this work are as follows: a. the design of a lightweight feature extraction network to enhance feature representation; b. the development of an adaptive weighted cross-modal fusion strategy to improve system robustness; and c. the incorporation of time-series information modeling to boost dynamic scene perception accuracy. Experimental results on the KITTI dataset demonstrate that the proposed approach increases navigation and positioning accuracy by 3.5% and 2.2%, respectively, while maintaining real-time performance. This work provides a novel solution for autonomous robot navigation in complex environments.