Generative Perception of Shape and Material from Differential Motion

📅 2025-06-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Single-image shape, material, and illumination estimation suffers from severe coupling, leading to perceptual ambiguity. To address this, we propose a joint shape–material disentanglement method leveraging short object-motion videos, where differential motion provides geometric and reflectance cues to alleviate single-view ambiguity. We introduce, for the first time, conditional denoising diffusion models into motion-driven generative material–shape perception, enabling pixel-wise end-to-end training. Our parameter-efficient architecture supports multimodal uncertainty modeling and achieves rapid distribution convergence under dynamic inputs, trained solely on synthetic video supervision. Experiments demonstrate high-fidelity joint estimation on synthetic data and strong generalization to real-world objects, significantly enhancing ambiguity resolution in static–dynamic协同 perception.

Technology Category

Application Category

📝 Abstract
Perceiving the shape and material of an object from a single image is inherently ambiguous, especially when lighting is unknown and unconstrained. Despite this, humans can often disentangle shape and material, and when they are uncertain, they often move their head slightly or rotate the object to help resolve the ambiguities. Inspired by this behavior, we introduce a novel conditional denoising-diffusion model that generates samples of shape-and-material maps from a short video of an object undergoing differential motions. Our parameter-efficient architecture allows training directly in pixel-space, and it generates many disentangled attributes of an object simultaneously. Trained on a modest number of synthetic object-motion videos with supervision on shape and material, the model exhibits compelling emergent behavior: For static observations, it produces diverse, multimodal predictions of plausible shape-and-material maps that capture the inherent ambiguities; and when objects move, the distributions quickly converge to more accurate explanations. The model also produces high-quality shape-and-material estimates for less ambiguous, real-world objects. By moving beyond single-view to continuous motion observations, our work suggests a generative perception approach for improving visual reasoning in physically-embodied systems.
Problem

Research questions and friction points this paper is trying to address.

Resolving shape-material ambiguity from differential motion videos
Generating diverse shape-material maps under static observations
Improving visual reasoning via continuous motion-based perception
Innovation

Methods, ideas, or system contributions that make the work stand out.

Conditional denoising-diffusion model for shape-material generation
Pixel-space training with parameter-efficient architecture
Generative perception from differential motion videos
X
Xinran Nicole Han
Harvard University
Ko Nishino
Ko Nishino
Professor, Kyoto University
Computer VisionArtificial IntelligenceMachine Learning
T
T. Zickler
Harvard University