🤖 AI Summary
Robotics tasks are often defined by combinations of multiple factors—such as goals, obstacles, and colors—and collecting expert demonstrations for all possible combinations is prohibitively expensive. This work proposes a factorized diffusion policy that leverages a single shared diffusion network trained with factor-level null-token dropout, enabling additive decomposition of the score function over factors during inference and thus supporting compositional generalization. It is the first approach to achieve factorized score decomposition within a single network, eliminating the need for ensemble architectures. The method introduces trajectory tube certificates that propagate score error bounds to closed-loop trajectories, providing theoretical guarantees on generalization performance. In drone racing tasks, it achieves a 90% success rate on unseen gate configurations—matching oracle performance—and vastly outperforms multi-network baselines (3%). In visual single-gate traversal, it enables zero-shot transfer to new environments, improving success rates by 11.7 percentage points and reducing collision rates by 2.4×.
📝 Abstract
Robotic tasks are typically specified by a tuple of factors, such as the object to be grasped, the obstacles to be avoided, the color of the target, and so on. Collecting expert demonstrations for every combination of factor values grows combinatorially. We present factored diffusion policies: a single shared diffusion network trained with per-factor null-token dropout, whose score decomposes additively across factors at inference. Under approximate conditional independence between factors given the action-observation pair, this composition approximates the true joint score with a bounded uniform error, reducing the training-task budget from a product of factor cardinalities to a sum. A trajectory-tube certificate chains this score-level bound through the reverse-time sampling ODE and a contracting tracking controller into a closed-loop state-trajectory tube whose radius factors into an ODE-sensitivity constant and a per-factor score-error budget. Unlike compositional-diffusion methods for control that combine separately trained networks, we use one shared network. Drone racing experiments confirm both the generalization bound and the certificate. On state-based multi-gate racing, the factored policy passes 90% of held-out gates -- matching an oracle -- while a K-network composition baseline collapses to 3%; on vision-based single-gate traversal, it transfers zero-shot to an unseen venue with +11.7pp success-rate gain and 2.4X crash-rate reduction.