🤖 AI Summary
In robotic manipulation, the absence of hand-mounted cameras leads to visual perception gaps, degrading control accuracy. Method: This paper introduces an “inference-time imagination” mechanism, pioneering the use of a LoRA-finetuned pre-trained diffusion model—ZeroNVS—for hardware-free, cross-view visual completion: given a global-view image and camera pose, it synthesizes high-fidelity virtual hand-eye views during inference. No additional sensors or system modifications are required. Contribution/Results: The approach significantly enhances the robustness of vision–motor policies under occluded or restricted-view conditions. Evaluated in simulation and real-world strawberry harvesting tasks, it achieves over 92% performance recovery, demonstrating effectiveness, generalizability, and deployment feasibility. Its core contribution lies in the first integration of lightweight, controllable diffusion-based generation into real-time robotic perception enhancement—establishing a novel pathway toward low-cost, highly adaptive embodied intelligence.
📝 Abstract
Visual observations from different viewpoints can significantly influence the performance of visuomotor policies in robotic manipulation. Among these, egocentric (in-hand) views often provide crucial information for precise control. However, in some applications, equipping robots with dedicated in-hand cameras may pose challenges due to hardware constraints, system complexity, and cost. In this work, we propose to endow robots with imaginative perception - enabling them to 'imagine' in-hand observations from agent views at inference time. We achieve this via novel view synthesis (NVS), leveraging a fine-tuned diffusion model conditioned on the relative pose between the agent and in-hand views cameras. Specifically, we apply LoRA-based fine-tuning to adapt a pretrained NVS model (ZeroNVS) to the robotic manipulation domain. We evaluate our approach on both simulation benchmarks (RoboMimic and MimicGen) and real-world experiments using a Unitree Z1 robotic arm for a strawberry picking task. Results show that synthesized in-hand views significantly enhance policy inference, effectively recovering the performance drop caused by the absence of real in-hand cameras. Our method offers a scalable and hardware-light solution for deploying robust visuomotor policies, highlighting the potential of imaginative visual reasoning in embodied agents.