🤖 AI Summary
This work addresses the weak cross-view generalization of policies in visual imitation learning by explicitly conditioning policies on camera extrinsics. Methodologically, it (1) employs Plücker embedding to geometrically encode pixel-level rays, thereby decoupling background semantics from camera pose; (2) introduces a novel evaluation protocol that exposes and mitigates policy reliance on static background shortcuts; and (3) uniformly integrates extrinsic parameters as conditional inputs into mainstream behavioral cloning frameworks—including ACT, Diffusion Policy, and SmolVLA. Experiments demonstrate significant improvements in cross-view generalization across six RGB-only manipulation tasks from RoboSuite and ManiSkill. Moreover, the approach maintains robust control under randomized camera poses, without requiring depth information or auxiliary sensors. The contributions thus lie in a geometrically principled conditioning mechanism, a diagnostic benchmark for background bias, and a plug-and-play extrinsic-aware extension compatible with state-of-the-art imitation learning architectures.
📝 Abstract
We study view-invariant imitation learning by explicitly conditioning policies on camera extrinsics. Using Plucker embeddings of per-pixel rays, we show that conditioning on extrinsics significantly improves generalization across viewpoints for standard behavior cloning policies, including ACT, Diffusion Policy, and SmolVLA. To evaluate policy robustness under realistic viewpoint shifts, we introduce six manipulation tasks in RoboSuite and ManiSkill that pair"fixed"and"randomized"scene variants, decoupling background cues from camera pose. Our analysis reveals that policies without extrinsics often infer camera pose using visual cues from static backgrounds in fixed scenes; this shortcut collapses when workspace geometry or camera placement shifts. Conditioning on extrinsics restores performance and yields robust RGB-only control without depth. We release the tasks, demonstrations, and code at https://ripl.github.io/know_your_camera/ .