OmniD: Generalizable Robot Manipulation Policy via Image-Based BEV Representation

📅 2025-08-16

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

Vision-based motion policies often suffer from poor generalization due to over-reliance on fixed viewpoints (e.g., camera pose and background), hindering robust 3D representation learning via multi-view fusion. To address this, we propose OmniD—a BEV (bird’s-eye view) representation learning framework grounded in multi-view image fusion. Its core is the Omni-Feature Generator, which employs deformable attention to enable task-driven feature selection and efficient cross-view feature aggregation. Integrated with diffusion-based policy learning and BEV projection, OmniD produces compact, semantically rich spatial representations that effectively suppress viewpoint-specific noise and background interference. Extensive experiments demonstrate OmniD’s strong generalization: it achieves average improvements of 11%, 17%, and 84% over state-of-the-art methods on in-distribution, out-of-distribution, and few-shot transfer benchmarks, respectively.

Technology Category

Application Category

📝 Abstract

The visuomotor policy can easily overfit to its training datasets, such as fixed camera positions and backgrounds. This overfitting makes the policy perform well in the in-distribution scenarios but underperform in the out-of-distribution generalization. Additionally, the existing methods also have difficulty fusing multi-view information to generate an effective 3D representation. To tackle these issues, we propose Omni-Vision Diffusion Policy (OmniD), a multi-view fusion framework that synthesizes image observations into a unified bird's-eye view (BEV) representation. We introduce a deformable attention-based Omni-Feature Generator (OFG) to selectively abstract task-relevant features while suppressing view-specific noise and background distractions. OmniD achieves 11%, 17%, and 84% average improvement over the best baseline model for in-distribution, out-of-distribution, and few-shot experiments, respectively. Training code and simulation benchmark are available: https://github.com/1mather/omnid.git

Problem

Research questions and friction points this paper is trying to address.

Addresses overfitting in visuomotor policies to training datasets

Solves multi-view fusion for effective 3D representation generation

Reduces view-specific noise and background distractions in robotics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bird's-eye view representation for multi-view fusion

Deformable attention-based feature generator

Selective abstraction of task-relevant features

🔎 Similar Papers

What Foundation Models can Bring for Robot Learning in Manipulation : A Survey