🤖 AI Summary
This work addresses the end-to-end generation of 3D animatable objects with open-set articulated structures from a single input image, overcoming the high modeling cost and poor generalizability of conventional approaches. We propose a unified implicit representation framework that jointly encodes geometry, texture, part segmentation, and kinematic parameters. A reversible joint-voxel embedding mechanism is introduced to precisely align joint semantics with voxel space. Joint type prediction is formulated as an open-set classification task, enabling generalization to unseen joint categories and object types. Leveraging a diffusion model, we co-optimize voxel-based geometry and joint semantics within a shared latent space. Evaluated on PartNet-Mobility, our method significantly outperforms multi-stage baselines, achieving state-of-the-art performance in mesh quality and joint motion accuracy.
📝 Abstract
Articulated 3D objects play a vital role in realistic simulation and embodied robotics, yet manually constructing such assets remains costly and difficult to scale. In this paper, we present UniArt, a diffusion-based framework that directly synthesizes fully articulated 3D objects from a single image in an end-to-end manner. Unlike prior multi-stage techniques, UniArt establishes a unified latent representation that jointly encodes geometry, texture, part segmentation, and kinematic parameters. We introduce a reversible joint-to-voxel embedding, which spatially aligns articulation features with volumetric geometry, enabling the model to learn coherent motion behaviors alongside structural formation. Furthermore, we formulate articulation type prediction as an open-set problem, removing the need for fixed joint semantics and allowing generalization to novel joint categories and unseen object types. Experiments on the PartNet-Mobility benchmark demonstrate that UniArt achieves state-of-the-art mesh quality and articulation accuracy.