Beyond the Visible: Disocclusion-Aware Editing via Proxy Dynamic Graphs

📅 2025-12-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing image-to-video methods struggle to simultaneously achieve motion controllability and content editability in disoccluded regions. To address this, we propose the Proxy Dynamic Graph (PDG), the first framework to explicitly model the decoupled relationship between visibility and motion, enabling training-free, inference-time controllable generation. PDG employs a lightweight graph structure to drive part-wise motion, integrating a frozen diffusion prior with motion-flow-guided, visibility-aware latent synthesis—thereby unifying loose pose editing and precise appearance specification. Our method significantly outperforms state-of-the-art approaches on articulated scenes—including furniture, vehicles, and deformable objects—while enabling accurate appearance editing in disoccluded regions. Generated videos exhibit physically plausible motion structures, high visual consistency across frames, and strong user controllability without fine-tuning.

Technology Category

Application Category

📝 Abstract
We address image-to-video generation with explicit user control over the final frame's disoccluded regions. Current image-to-video pipelines produce plausible motion but struggle to generate predictable, articulated motions while enforcing user-specified content in newly revealed areas. Our key idea is to separate motion specification from appearance synthesis: we introduce a lightweight, user-editable Proxy Dynamic Graph (PDG) that deterministically yet approximately drives part motion, while a frozen diffusion prior is used to synthesize plausible appearance that follows that motion. In our training-free pipeline, the user loosely annotates and reposes a PDG, from which we compute a dense motion flow to leverage diffusion as a motion-guided shader. We then let the user edit appearance in the disoccluded areas of the image, and exploit the visibility information encoded by the PDG to perform a latent-space composite that reconciles motion with user intent in these areas. This design yields controllable articulation and user control over disocclusions without fine-tuning. We demonstrate clear advantages against state-of-the-art alternatives towards images turned into short videos of articulated objects, furniture, vehicles, and deformables. Our method mixes generative control, in the form of loose pose and structure, with predictable controls, in the form of appearance specification in the final frame in the disoccluded regions, unlocking a new image-to-video workflow. Code will be released on acceptance. Project page: https://anranqi.github.io/beyondvisible.github.io/
Problem

Research questions and friction points this paper is trying to address.

Generating controllable motion in image-to-video synthesis.
Enforcing user-specified content in newly revealed disoccluded areas.
Separating motion specification from appearance synthesis for predictability.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses editable Proxy Dynamic Graph for part motion control
Leverages frozen diffusion prior as motion-guided appearance shader
Performs latent-space composite for user-specified disoccluded areas
🔎 Similar Papers
No similar papers found.