Do You Know Where Your Camera Is? View-Invariant Policy Learning with Camera Conditioning

📅 2025-10-02

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This work addresses the weak cross-view generalization of policies in visual imitation learning by explicitly conditioning policies on camera extrinsics. Methodologically, it (1) employs Plücker embedding to geometrically encode pixel-level rays, thereby decoupling background semantics from camera pose; (2) introduces a novel evaluation protocol that exposes and mitigates policy reliance on static background shortcuts; and (3) uniformly integrates extrinsic parameters as conditional inputs into mainstream behavioral cloning frameworks—including ACT, Diffusion Policy, and SmolVLA. Experiments demonstrate significant improvements in cross-view generalization across six RGB-only manipulation tasks from RoboSuite and ManiSkill. Moreover, the approach maintains robust control under randomized camera poses, without requiring depth information or auxiliary sensors. The contributions thus lie in a geometrically principled conditioning mechanism, a diagnostic benchmark for background bias, and a plug-and-play extrinsic-aware extension compatible with state-of-the-art imitation learning architectures.

Technology Category

Application Category

📝 Abstract

We study view-invariant imitation learning by explicitly conditioning policies on camera extrinsics. Using Plucker embeddings of per-pixel rays, we show that conditioning on extrinsics significantly improves generalization across viewpoints for standard behavior cloning policies, including ACT, Diffusion Policy, and SmolVLA. To evaluate policy robustness under realistic viewpoint shifts, we introduce six manipulation tasks in RoboSuite and ManiSkill that pair"fixed"and"randomized"scene variants, decoupling background cues from camera pose. Our analysis reveals that policies without extrinsics often infer camera pose using visual cues from static backgrounds in fixed scenes; this shortcut collapses when workspace geometry or camera placement shifts. Conditioning on extrinsics restores performance and yields robust RGB-only control without depth. We release the tasks, demonstrations, and code at https://ripl.github.io/know_your_camera/ .

Problem

Research questions and friction points this paper is trying to address.

Learning view-invariant robot manipulation policies using camera extrinsics

Improving policy generalization across viewpoints without depth sensors

Addressing performance collapse when camera pose or background changes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Policies explicitly conditioned on camera extrinsics

Used Plucker embeddings of per-pixel rays

Enabled robust RGB-only control without depth

🔎 Similar Papers

Observe Then Act: Asynchronous Active Vision-Action Model for Robotic Manipulation

2024-09-23arXiv.orgCitations: 0