🤖 AI Summary
Natural language navigation in dynamic environments suffers from poor generalization due to combinatorial explosion of instruction variations. Method: This paper proposes a Composable Diffusion Framework that decomposes multi-scale navigation instructions into independent motion primitives and synthesizes them via parallel diffusion models, enabling primitive-level compositionality and zero-shot combinatorial generalization. A two-stage training strategy—supervised pretraining followed by reinforcement learning fine-tuning—is employed to eliminate reliance on per-primitive demonstration data. Contribution/Results: Evaluated on both simulation and real-robot platforms, the method achieves significantly higher accuracy and robustness than VLM- and cost-map-based baselines on unseen instruction combinations, demonstrating flexible, high-precision navigation control under complex, dynamic conditions.
📝 Abstract
This paper considers the problem of enabling robots to navigate dynamic environments while following instructions. The challenge lies in the combinatorial nature of instruction specifications: each instruction can include multiple specifications, and the number of possible specification combinations grows exponentially as the robot's skill set expands. For example, "overtake the pedestrian while staying on the right side of the road" consists of two specifications: "overtake the pedestrian" and "walk on the right side of the road." To tackle this challenge, we propose ComposableNav, based on the intuition that following an instruction involves independently satisfying its constituent specifications, each corresponding to a distinct motion primitive. Using diffusion models, ComposableNav learns each primitive separately, then composes them in parallel at deployment time to satisfy novel combinations of specifications unseen in training. Additionally, to avoid the onerous need for demonstrations of individual motion primitives, we propose a two-stage training procedure: (1) supervised pre-training to learn a base diffusion model for dynamic navigation, and (2) reinforcement learning fine-tuning that molds the base model into different motion primitives. Through simulation and real-world experiments, we show that ComposableNav enables robots to follow instructions by generating trajectories that satisfy diverse and unseen combinations of specifications, significantly outperforming both non-compositional VLM-based policies and costmap composing baselines. Videos and additional materials can be found on the project page: https://amrl.cs.utexas.edu/ComposableNav/