Ctrl&Shift: High-Quality Geometry-Aware Object Manipulation in Visual Generation

📅 2026-02-11

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Existing methods for image and video object manipulation struggle to simultaneously preserve background content, maintain geometric consistency across viewpoints, and offer fine-grained user control. This work proposes Ctrl&Shift, an end-to-end diffusion framework that decomposes manipulation into two stages—object removal followed by camera-pose-guided reference-based inpainting—enabling geometrically consistent editing within a unified diffusion process without explicit 3D modeling. By integrating explicit camera pose control, reference-guided inpainting, and a multi-task, multi-stage training strategy, the method effectively disentangles background, identity, and pose signals. This design preserves generalization to real-world scenes while supporting precise geometric manipulation. Experiments demonstrate that Ctrl&Shift significantly outperforms existing geometry-based and diffusion-based approaches in terms of generation fidelity, viewpoint consistency, and user controllability.

Technology Category

Application Category

📝 Abstract

Object-level manipulation, relocating or reorienting objects in images or videos while preserving scene realism, is central to film post-production, AR, and creative editing. Yet existing methods struggle to jointly achieve three core goals: background preservation, geometric consistency under viewpoint shifts, and user-controllable transformations. Geometry-based approaches offer precise control but require explicit 3D reconstruction and generalize poorly; diffusion-based methods generalize better but lack fine-grained geometric control. We present Ctrl&Shift, an end-to-end diffusion framework to achieve geometry-consistent object manipulation without explicit 3D representations. Our key insight is to decompose manipulation into two stages, object removal and reference-guided inpainting under explicit camera pose control, and encode both within a unified diffusion process. To enable precise, disentangled control, we design a multi-task, multi-stage training strategy that separates background, identity, and pose signals across tasks. To improve generalization, we introduce a scalable real-world dataset construction pipeline that generates paired image and video samples with estimated relative camera poses. Extensive experiments demonstrate that Ctrl&Shift achieves state-of-the-art results in fidelity, viewpoint consistency, and controllability. To our knowledge, this is the first framework to unify fine-grained geometric control and real-world generalization for object manipulation, without relying on any explicit 3D modeling.

Problem

Research questions and friction points this paper is trying to address.

object manipulation

geometric consistency

background preservation

user-controllable transformation

viewpoint shift

Innovation

Methods, ideas, or system contributions that make the work stand out.

geometry-aware manipulation

diffusion model

camera pose control