🤖 AI Summary
To address key bottlenecks in multi-robot language-conditioned motion planning—namely, poor generalization of diffusion models, high inference overhead, and reliance on explicit environmental modeling and geometric reachability priors—this paper proposes LCHD, an end-to-end vision-driven framework. LCHD eliminates conventional obstacle inputs and explicit environment representations, directly processing RGB images and natural language instructions to generate collision-free trajectories. Its core innovation lies in integrating a heat-equation-inspired diffusion kernel as a physics-informed prior, tightly coupled with CLIP-based semantic encoding, enabling reachability-aware language understanding and robust out-of-distribution generalization. Evaluated across diverse real-world maps and physical robot platforms, LCHD achieves significantly higher task success rates, reduces inference latency by an order of magnitude, and operates entirely without runtime obstacle information.
📝 Abstract
Diffusion models have recently emerged as powerful tools for robot motion planning by capturing the multi-modal distribution of feasible trajectories. However, their extension to multi-robot settings with flexible, language-conditioned task specifications remains limited. Furthermore, current diffusion-based approaches incur high computational cost during inference and struggle with generalization because they require explicit construction of environment representations and lack mechanisms for reasoning about geometric reachability. To address these limitations, we present Language-Conditioned Heat-Inspired Diffusion (LCHD), an end-to-end vision-based framework that generates language-conditioned, collision-free trajectories. LCHD integrates CLIP-based semantic priors with a collision-avoiding diffusion kernel serving as a physical inductive bias that enables the planner to interpret language commands strictly within the reachable workspace. This naturally handles out-of-distribution scenarios -- in terms of reachability -- by guiding robots toward accessible alternatives that match the semantic intent, while eliminating the need for explicit obstacle information at inference time. Extensive evaluations on diverse real-world-inspired maps, along with real-robot experiments, show that LCHD consistently outperforms prior diffusion-based planners in success rate, while reducing planning latency.