🤖 AI Summary
Existing diffusion-based imitation learning struggles to model multi-step strong temporal dependencies and tends to overfit to proprioceptive inputs, thereby neglecting critical visual cues—leading to poor policy robustness under out-of-distribution conditions. To address this, we propose a dual-branch diffusion policy framework: a vision branch dedicated to state recovery and task retry, and a proprioceptive branch handling low-level control. We explicitly model visual dynamics via a Deep Koopman operator and employ the generative loss as a confidence signal to guide action block fusion. Our method integrates diffusion modeling, action chunking, temporal overlapping aggregation, and multimodal disentanglement. Evaluated on six simulated tasks, it achieves an average performance gain of 14.6%; on three real-robot tasks, it improves by 15.0%. The approach significantly enhances cross-distribution generalization and failure recovery capability.
📝 Abstract
Integrating generative models with action chunking has shown significant promise in imitation learning for robotic manipulation. However, the existing diffusion-based paradigm often struggles to capture strong temporal dependencies across multiple steps, particularly when incorporating proprioceptive input. This limitation can lead to task failures, where the policy overfits to proprioceptive cues at the expense of capturing the visually derived features of the task. To overcome this challenge, we propose the Deep Koopman-boosted Dual-branch Diffusion Policy (D3P) algorithm. D3P introduces a dual-branch architecture to decouple the roles of different sensory modality combinations. The visual branch encodes the visual observations to indicate task progression, while the fused branch integrates both visual and proprioceptive inputs for precise manipulation. Within this architecture, when the robot fails to accomplish intermediate goals, such as grasping a drawer handle, the policy can dynamically switch to execute action chunks generated by the visual branch, allowing recovery to previously observed states and facilitating retrial of the task. To further enhance visual representation learning, we incorporate a Deep Koopman Operator module that captures structured temporal dynamics from visual inputs. During inference, we use the test-time loss of the generative model as a confidence signal to guide the aggregation of the temporally overlapping predicted action chunks, thereby enhancing the reliability of policy execution. In simulation experiments across six RLBench tabletop tasks, D3P outperforms the state-of-the-art diffusion policy by an average of 14.6%. On three real-world robotic manipulation tasks, it achieves a 15.0% improvement. Code: https://github.com/dianyeHuang/D3P.