Improving Robustness to Out-of-Distribution States in Imitation Learning via Deep Koopman-Boosted Diffusion Policy

📅 2025-11-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing diffusion-based imitation learning struggles to model multi-step strong temporal dependencies and tends to overfit to proprioceptive inputs, thereby neglecting critical visual cues—leading to poor policy robustness under out-of-distribution conditions. To address this, we propose a dual-branch diffusion policy framework: a vision branch dedicated to state recovery and task retry, and a proprioceptive branch handling low-level control. We explicitly model visual dynamics via a Deep Koopman operator and employ the generative loss as a confidence signal to guide action block fusion. Our method integrates diffusion modeling, action chunking, temporal overlapping aggregation, and multimodal disentanglement. Evaluated on six simulated tasks, it achieves an average performance gain of 14.6%; on three real-robot tasks, it improves by 15.0%. The approach significantly enhances cross-distribution generalization and failure recovery capability.

Technology Category

Application Category

📝 Abstract
Integrating generative models with action chunking has shown significant promise in imitation learning for robotic manipulation. However, the existing diffusion-based paradigm often struggles to capture strong temporal dependencies across multiple steps, particularly when incorporating proprioceptive input. This limitation can lead to task failures, where the policy overfits to proprioceptive cues at the expense of capturing the visually derived features of the task. To overcome this challenge, we propose the Deep Koopman-boosted Dual-branch Diffusion Policy (D3P) algorithm. D3P introduces a dual-branch architecture to decouple the roles of different sensory modality combinations. The visual branch encodes the visual observations to indicate task progression, while the fused branch integrates both visual and proprioceptive inputs for precise manipulation. Within this architecture, when the robot fails to accomplish intermediate goals, such as grasping a drawer handle, the policy can dynamically switch to execute action chunks generated by the visual branch, allowing recovery to previously observed states and facilitating retrial of the task. To further enhance visual representation learning, we incorporate a Deep Koopman Operator module that captures structured temporal dynamics from visual inputs. During inference, we use the test-time loss of the generative model as a confidence signal to guide the aggregation of the temporally overlapping predicted action chunks, thereby enhancing the reliability of policy execution. In simulation experiments across six RLBench tabletop tasks, D3P outperforms the state-of-the-art diffusion policy by an average of 14.6%. On three real-world robotic manipulation tasks, it achieves a 15.0% improvement. Code: https://github.com/dianyeHuang/D3P.
Problem

Research questions and friction points this paper is trying to address.

Addresses imitation learning's struggle with temporal dependencies across steps
Overcomes policy overfitting to proprioceptive cues at visual feature expense
Enhances robustness to out-of-distribution states in robotic manipulation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-branch architecture decouples different sensory modality roles
Deep Koopman Operator captures structured temporal visual dynamics
Test-time loss guides aggregation of overlapping predicted action chunks
🔎 Similar Papers
No similar papers found.