🤖 AI Summary
Vision-based motion policies often suffer from poor generalization due to over-reliance on fixed viewpoints (e.g., camera pose and background), hindering robust 3D representation learning via multi-view fusion. To address this, we propose OmniD—a BEV (bird’s-eye view) representation learning framework grounded in multi-view image fusion. Its core is the Omni-Feature Generator, which employs deformable attention to enable task-driven feature selection and efficient cross-view feature aggregation. Integrated with diffusion-based policy learning and BEV projection, OmniD produces compact, semantically rich spatial representations that effectively suppress viewpoint-specific noise and background interference. Extensive experiments demonstrate OmniD’s strong generalization: it achieves average improvements of 11%, 17%, and 84% over state-of-the-art methods on in-distribution, out-of-distribution, and few-shot transfer benchmarks, respectively.
📝 Abstract
The visuomotor policy can easily overfit to its training datasets, such as fixed camera positions and backgrounds. This overfitting makes the policy perform well in the in-distribution scenarios but underperform in the out-of-distribution generalization. Additionally, the existing methods also have difficulty fusing multi-view information to generate an effective 3D representation. To tackle these issues, we propose Omni-Vision Diffusion Policy (OmniD), a multi-view fusion framework that synthesizes image observations into a unified bird's-eye view (BEV) representation. We introduce a deformable attention-based Omni-Feature Generator (OFG) to selectively abstract task-relevant features while suppressing view-specific noise and background distractions. OmniD achieves 11%, 17%, and 84% average improvement over the best baseline model for in-distribution, out-of-distribution, and few-shot experiments, respectively. Training code and simulation benchmark are available: https://github.com/1mather/omnid.git