Direct Contact-Tolerant Motion Planning With Vision Language Models

📅 2026-03-05

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

This work addresses the limitations of existing contact-tolerant motion planning methods, which rely on indirect spatial representations and struggle with uncertainties posed by movable or deformable objects in cluttered environments. The authors propose Direct Contact-Tolerant (DCT) planning, the first framework to integrate vision-language models (VLMs) into contact-tolerant reasoning. By leveraging VLM-driven point cloud segmentation, DCT generates contact-aware representations and ensures semantic consistency across image-to-point-cloud space through an odometry-assisted cross-frame mask propagation mechanism. Formulating planning as an end-to-end perception-to-control optimization problem, DCT demonstrates significant performance gains over current baselines in both Isaac Sim simulations and real-world experiments on wheeled robots, achieving efficient and robust navigation in complex scenes with movable obstacles.

Technology Category

Application Category

📝 Abstract

Navigation in cluttered environments often requires robots to tolerate contact with movable or deformable objects to maintain efficiency. Existing contact-tolerant motion planning (CTMP) methods rely on indirect spatial representations (e.g., prebuilt map, obstacle set), resulting in inaccuracies and a lack of adaptiveness to environmental uncertainties. To address this issue, we propose a direct contact-tolerant (DCT) planner, which integrates vision-language models (VLMs) into direct point perception and navigation, including two key components. The first one is VLM point cloud partitioner (VPP), which performs contact-tolerance reasoning in image space using VLM, caches inference masks, propagates them across frames using odometry, and projects them onto the current scan to generate a contact-aware point cloud. The second innovation is VPP guided navigation (VGN), which formulates CTMP as a perception-to-control optimization problem under direct contact-aware point cloud constraints, which is further solved by a specialized deep neural network (DNN). We implement DCT in Isaac Sim and a real car-like robot, demonstrating that DCT achieves robust and efficient navigation in cluttered environments with movable obstacles, outperforming representative baselines across diverse metrics. The code is available at: https://github.com/ChrisLeeUM/DCT.

Problem

Research questions and friction points this paper is trying to address.

contact-tolerant motion planning

cluttered environments

environmental uncertainties

movable obstacles

spatial representations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language Models

Contact-Tolerant Motion Planning

Direct Point Perception