Tinker: Diffusion's Gift to 3D--Multi-View Consistent Editing From Sparse Inputs without Per-Scene Optimization

📅 2025-08-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the key challenge of achieving high-fidelity, multi-view-consistent 3D editing from only 1–2 input images, without scene-specific fine-tuning. We propose an end-to-end reference-guided multi-view editor and an arbitrary-view-to-video synthesizer—enabling, for the first time, zero-shot, multi-view-consistent 3D content editing and novel-view synthesis. Our method leverages pre-trained diffusion models, integrating spatiotemporal video priors, explicit multi-view geometric consistency constraints, and reference-driven editing mechanisms—fully eliminating per-scene optimization. Quantitative and qualitative evaluations demonstrate state-of-the-art performance in editing fidelity, novel-view synthesis quality, and rendering enhancement. The approach significantly lowers the barrier to general-purpose 3D content creation, enabling high-quality, cross-style and cross-scene editing without retraining or per-instance adaptation.

Technology Category

Application Category

📝 Abstract
We introduce Tinker, a versatile framework for high-fidelity 3D editing that operates in both one-shot and few-shot regimes without any per-scene finetuning. Unlike prior techniques that demand extensive per-scene optimization to ensure multi-view consistency or to produce dozens of consistent edited input views, Tinker delivers robust, multi-view consistent edits from as few as one or two images. This capability stems from repurposing pretrained diffusion models, which unlocks their latent 3D awareness. To drive research in this space, we curate the first large-scale multi-view editing dataset and data pipeline, spanning diverse scenes and styles. Building on this dataset, we develop our framework capable of generating multi-view consistent edited views without per-scene training, which consists of two novel components: (1) Referring multi-view editor: Enables precise, reference-driven edits that remain coherent across all viewpoints. (2) Any-view-to-video synthesizer: Leverages spatial-temporal priors from video diffusion to perform high-quality scene completion and novel-view generation even from sparse inputs. Through extensive experiments, Tinker significantly reduces the barrier to generalizable 3D content creation, achieving state-of-the-art performance on editing, novel-view synthesis, and rendering enhancement tasks. We believe that Tinker represents a key step towards truly scalable, zero-shot 3D editing. Project webpage: https://aim-uofa.github.io/Tinker
Problem

Research questions and friction points this paper is trying to address.

Achieving multi-view consistent 3D editing from sparse inputs
Eliminating per-scene optimization requirements for 3D editing
Enabling zero-shot 3D content creation without scene-specific training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Repurposing pretrained diffusion models for 3D
Referring multi-view editor for coherent edits
Any-view-to-video synthesizer from sparse inputs
🔎 Similar Papers
No similar papers found.
Canyu Zhao
Canyu Zhao
Zhejiang University
Generative ModelDeep Learning
X
Xiaoman Li
Zhejiang University, China
T
Tianjian Feng
Zhejiang University, China
Z
Zhiyue Zhao
Zhejiang University, China
H
Hao Chen
Zhejiang University, China
Chunhua Shen
Chunhua Shen
Zhejiang University
Computer VisionMachine Learning