Generative Perception of Shape and Material from Differential Motion

📅 2025-06-03

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

Single-image shape, material, and illumination estimation suffers from severe coupling, leading to perceptual ambiguity. To address this, we propose a joint shape–material disentanglement method leveraging short object-motion videos, where differential motion provides geometric and reflectance cues to alleviate single-view ambiguity. We introduce, for the first time, conditional denoising diffusion models into motion-driven generative material–shape perception, enabling pixel-wise end-to-end training. Our parameter-efficient architecture supports multimodal uncertainty modeling and achieves rapid distribution convergence under dynamic inputs, trained solely on synthetic video supervision. Experiments demonstrate high-fidelity joint estimation on synthetic data and strong generalization to real-world objects, significantly enhancing ambiguity resolution in static–dynamic协同 perception.

Technology Category

Application Category

📝 Abstract

Perceiving the shape and material of an object from a single image is inherently ambiguous, especially when lighting is unknown and unconstrained. Despite this, humans can often disentangle shape and material, and when they are uncertain, they often move their head slightly or rotate the object to help resolve the ambiguities. Inspired by this behavior, we introduce a novel conditional denoising-diffusion model that generates samples of shape-and-material maps from a short video of an object undergoing differential motions. Our parameter-efficient architecture allows training directly in pixel-space, and it generates many disentangled attributes of an object simultaneously. Trained on a modest number of synthetic object-motion videos with supervision on shape and material, the model exhibits compelling emergent behavior: For static observations, it produces diverse, multimodal predictions of plausible shape-and-material maps that capture the inherent ambiguities; and when objects move, the distributions quickly converge to more accurate explanations. The model also produces high-quality shape-and-material estimates for less ambiguous, real-world objects. By moving beyond single-view to continuous motion observations, our work suggests a generative perception approach for improving visual reasoning in physically-embodied systems.

Problem

Research questions and friction points this paper is trying to address.

Resolving shape-material ambiguity from differential motion videos

Generating diverse shape-material maps under static observations

Improving visual reasoning via continuous motion-based perception

Innovation

Methods, ideas, or system contributions that make the work stand out.

Conditional denoising-diffusion model for shape-material generation

Pixel-space training with parameter-efficient architecture

Generative perception from differential motion videos

🔎 Similar Papers

Shape-Space Deformer: Unified Visuo-Tactile Representations for Robotic Manipulation of Deformable Objects

2024-09-19arXiv.orgCitations: 0