Sample from What You See: Visuomotor Policy Learning via Diffusion Bridge with Observation-Embedded Stochastic Differential Equation

๐Ÿ“… 2025-12-08
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing imitation learning methods treat observations solely as conditional inputs to diffusion denoising networks, initializing sampling from random Gaussian noiseโ€”thereby decoupling perception from control and limiting performance. This paper proposes BridgePolicy, the first approach to directly embed visual and state observations into the stochastic differential equation (SDE) dynamics of the diffusion process, enabling generative policy modeling grounded in information priors rather than noise. To address heterogeneity between multimodal observations and action spaces, we design a multimodal fusion module and a semantic aligner. Evaluated on three simulation benchmarks (52 tasks) and five real-world robotic tasks, BridgePolicy significantly outperforms state-of-the-art methods, demonstrating superior control accuracy, robustness to distribution shifts, and cross-scenario generalization capability.

Technology Category

Application Category

๐Ÿ“ Abstract
Imitation learning with diffusion models has advanced robotic control by capturing multi-modal action distributions. However, existing approaches typically treat observations as high-level conditioning inputs to the denoising network, rather than integrating them into the stochastic dynamics of the diffusion process itself. As a result, sampling must begin from random Gaussian noise, weakening the coupling between perception and control and often yielding suboptimal performance. We introduce BridgePolicy, a generative visuomotor policy that explicitly embeds observations within the stochastic differential equation via a diffusion-bridge formulation. By constructing an observation-informed trajectory, BridgePolicy enables sampling to start from a rich, informative prior rather than random noise, substantially improving precision and reliability in control. A key challenge is that classical diffusion bridges connect distributions with matched dimensionality, whereas robotic observations are heterogeneous and multi-modal and do not naturally align with the action space. To address this, we design a multi-modal fusion module and a semantic aligner that unify visual and state inputs and align observation and action representations, making the bridge applicable to heterogeneous robot data. Extensive experiments across 52 simulation tasks on three benchmarks and five real-world tasks demonstrate that BridgePolicy consistently outperforms state-of-the-art generative policies.
Problem

Research questions and friction points this paper is trying to address.

Integrates observations into diffusion process dynamics
Starts sampling from informative prior, not random noise
Aligns heterogeneous observations with action representations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Embedding observations into diffusion bridge SDE
Starting sampling from observation-informed prior
Aligning heterogeneous observations with action space
๐Ÿ”Ž Similar Papers
No similar papers found.