LAMP: Lift Image-Editing as General 3D Priors for Open-world Manipulation

📅 2026-04-09

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

This work addresses the limited generalization capability of existing approaches in open-world robotic manipulation and their difficulty in modeling fine-grained spatial and geometric relationships. To overcome these challenges, the authors propose a novel method that elevates 2D spatial cues from image editing into continuous, geometry-aware 3D transformation representations. This approach uniquely reframes image editing as a source of general-purpose 3D priors, seamlessly integrating language instructions with geometric reasoning to construct a zero-shot manipulation framework that requires no task-specific training. Experimental results demonstrate that the proposed framework achieves high-precision 3D transformation prediction and exhibits strong zero-shot generalization in open-world settings, substantially advancing beyond current models’ limitations in 3D spatial understanding.

Technology Category

Application Category

📝 Abstract

Human-like generalization in open-world remains a fundamental challenge for robotic manipulation. Existing learning-based methods, including reinforcement learning, imitation learning, and vision-language-action-models (VLAs), often struggle with novel tasks and unseen environments. Another promising direction is to explore generalizable representations that capture fine-grained spatial and geometric relations for open-world manipulation. While large-language-model (LLMs) and vision-language-model (VLMs) provide strong semantic reasoning based on language or annotated 2D representations, their limited 3D awareness restricts their applicability to fine-grained manipulation. To address this, we propose LAMP, which lifts image-editing as 3D priors to extract inter-object 3D transformations as continuous, geometry-aware representations. Our key insight is that image-editing inherently encodes rich 2D spatial cues, and lifting these implicit cues into 3D transformations provides fine-grained and accurate guidance for open-world manipulation. Extensive experiments demonstrate that \codename delivers precise 3D transformations and achieves strong zero-shot generalization in open-world manipulation. Project page: https://zju3dv.github.io/LAMP/.

Problem

Research questions and friction points this paper is trying to address.

open-world manipulation

3D priors

generalization

robotic manipulation

fine-grained spatial relations

Innovation

Methods, ideas, or system contributions that make the work stand out.

3D priors

image editing

open-world manipulation