🤖 AI Summary
This work addresses the challenge of generating physically plausible, functionally appropriate, and contextually consistent poses for rigged 3D objects in open-domain real-world scenes. We propose the first topology-agnostic, functionality-aware pose synthesis framework. Methodologically, it synergistically integrates 2D diffusion-based image inpainting with differentiable rendering to achieve cross-modal semantic alignment; further, it introduces semantic keypoint matching and control-guided optimization to enforce functional constraints and environmental compatibility. The framework operates robustly on arbitrary internet-sourced rigged models paired with real-scene meshes, converging stably within minutes to produce high-fidelity, high-confidence poses. Key contributions include: (1) the first topology-free, open-domain, functionality-driven pose synthesis paradigm; (2) a novel diffusion-rendering co-design architecture for cross-modal alignment; and (3) substantial reduction in reliance on expert artistic knowledge and manual hyperparameter tuning.
📝 Abstract
Rigged objects are commonly used in artist pipelines, as they can flexibly adapt to different scenes and postures. However, articulating the rigs into realistic affordance-aware postures (e.g., following the context, respecting the physics and the personalities of the object) remains time-consuming and heavily relies on human labor from experienced artists. In this paper, we tackle the novel problem and design A3Syn. With a given context, such as the environment mesh and a text prompt of the desired posture, A3Syn synthesizes articulation parameters for arbitrary and open-domain rigged objects obtained from the Internet. The task is incredibly challenging due to the lack of training data, and we do not make any topological assumptions about the open-domain rigs. We propose using 2D inpainting diffusion model and several control techniques to synthesize in-context affordance information. Then, we develop an efficient bone correspondence alignment using a combination of differentiable rendering and semantic correspondence. A3Syn has stable convergence, completes in minutes, and synthesizes plausible affordance on different combinations of in-the-wild object rigs and scenes.