Multimodal embodiment-aware navigation transformer

📅 2026-04-21

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This work addresses the degraded obstacle avoidance performance of goal-conditioned navigation models under variations in environment, robot platform, or sensor configuration. To this end, the authors propose ViLiNT, a multimodal navigation approach that fuses RGB images, 3D LiDAR point clouds, goal embeddings, and robot embodiment descriptors. Built upon a Transformer architecture, ViLiNT generates traversable trajectories and introduces robot embodiment information—novel in multimodal navigation and diffusion-based models—to train an offline path feasibility prediction head using automatically annotated data for trajectory scoring and ranking. Experimental results demonstrate that ViLiNT achieves a 166% average improvement in success rate over the state-of-the-art vision-based baseline NoMaD across three simulated environments and exhibits significantly enhanced robustness and cross-platform generalization in real-world off-road navigation scenarios.

Technology Category

Application Category

📝 Abstract

Goal-conditioned navigation models for ground robots trained using supervised learning show promising zero-shot transfer, but their collision-avoidance capability nevertheless degrades under distribution shift, i.e. environmental, robot or sensor configuration changes. We propose ViLiNT a multimodal, attention-based policy for goal navigation, trained on heterogeneous data from multiple platforms and environments, which improves robustness with two key features. First, we fuse RGB images, 3D LiDAR point clouds, a goal embedding and a robot's embodiment descriptor with a transformer architecture to capture complementary geometry and appearance cues. The transformer's output is used to condition a diffusion model that generates navigable trajectories. Second, using automatically generated offline labels, we train a path clearance prediction head for scoring and ranking trajectories produced by the diffusion model. The diffusion conditioning as well as the trajectory ranking head depend on a robot's embodiment token that allows our model to generate and select trajectories with respect to the robot's dimensions. Across three simulated environments, ViLiNT improves Success Rate on average by 166\% over equivalent state-of-the-art vision-only baseline (NoMaD). This increase in performance is confirmed through real-world deployments of a rover navigating in obstacle fields. These results highlight that combining multimodal fusion with our collision prediction mechanism leads to improved off-road navigation robustness.

Problem

Research questions and friction points this paper is trying to address.

goal-conditioned navigation

distribution shift

collision avoidance

embodiment-aware

multimodal navigation

Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal fusion

embodiment-aware navigation

diffusion-based trajectory generation