🤖 AI Summary
This work addresses the limitations of humanoid robots in unstructured, human-centric environments due to insufficient terrain perception. The authors propose a learning-based multimodal fusion framework that integrates LiDAR, depth camera, and IMU data to generate robot-centric elevation maps. A hybrid encoder-decoder network combining CNNs and GRUs is designed to jointly optimize spatial feature extraction and temporal consistency modeling. Spherical projection is employed to process data from the LIVOX MID-360 LiDAR and Intel RealSense depth sensor. Experimental results demonstrate that the proposed method improves reconstruction accuracy by 7.2% and 9.9% compared to using depth or LiDAR data alone, respectively, and effectively suppresses mapping drift by leveraging a 3.2-second temporal context.
📝 Abstract
Reliable terrain perception is a critical prerequisite for the deployment of humanoid robots in unstructured, human-centric environments. While traditional systems often rely on manually engineered, single-sensor pipelines, this paper presents a learning-based framework that uses an intermediate, robot-centric heightmap representation. A hybrid Encoder-Decoder Structure (EDS) is introduced, utilizing a Convolutional Neural Network (CNN) for spatial feature extraction fused with a Gated Recurrent Unit (GRU) core for temporal consistency. The architecture integrates multimodal data from an Intel RealSense depth camera, a LIVOX MID-360 LiDAR processed via efficient spherical projection, and an onboard IMU. Quantitative results demonstrate that multimodal fusion improves reconstruction accuracy by 7.2% over depth-only and 9.9% over LiDAR-only configurations. Furthermore, the integration of a 3.2 s temporal context reduces mapping drift.