Imagination at Inference: Synthesizing In-Hand Views for Robust Visuomotor Policy Inference

📅 2025-09-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In robotic manipulation, the absence of hand-mounted cameras leads to visual perception gaps, degrading control accuracy. Method: This paper introduces an “inference-time imagination” mechanism, pioneering the use of a LoRA-finetuned pre-trained diffusion model—ZeroNVS—for hardware-free, cross-view visual completion: given a global-view image and camera pose, it synthesizes high-fidelity virtual hand-eye views during inference. No additional sensors or system modifications are required. Contribution/Results: The approach significantly enhances the robustness of vision–motor policies under occluded or restricted-view conditions. Evaluated in simulation and real-world strawberry harvesting tasks, it achieves over 92% performance recovery, demonstrating effectiveness, generalizability, and deployment feasibility. Its core contribution lies in the first integration of lightweight, controllable diffusion-based generation into real-time robotic perception enhancement—establishing a novel pathway toward low-cost, highly adaptive embodied intelligence.

Technology Category

Application Category

📝 Abstract
Visual observations from different viewpoints can significantly influence the performance of visuomotor policies in robotic manipulation. Among these, egocentric (in-hand) views often provide crucial information for precise control. However, in some applications, equipping robots with dedicated in-hand cameras may pose challenges due to hardware constraints, system complexity, and cost. In this work, we propose to endow robots with imaginative perception - enabling them to 'imagine' in-hand observations from agent views at inference time. We achieve this via novel view synthesis (NVS), leveraging a fine-tuned diffusion model conditioned on the relative pose between the agent and in-hand views cameras. Specifically, we apply LoRA-based fine-tuning to adapt a pretrained NVS model (ZeroNVS) to the robotic manipulation domain. We evaluate our approach on both simulation benchmarks (RoboMimic and MimicGen) and real-world experiments using a Unitree Z1 robotic arm for a strawberry picking task. Results show that synthesized in-hand views significantly enhance policy inference, effectively recovering the performance drop caused by the absence of real in-hand cameras. Our method offers a scalable and hardware-light solution for deploying robust visuomotor policies, highlighting the potential of imaginative visual reasoning in embodied agents.
Problem

Research questions and friction points this paper is trying to address.

Synthesizing in-hand camera views from agent perspectives
Overcoming hardware constraints for robotic visuomotor policies
Enhancing policy performance without physical in-hand cameras
Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthesizing in-hand views via diffusion model
Fine-tuning pretrained NVS with LoRA adaptation
Conditioning on relative camera pose parameters
🔎 Similar Papers