PAct: Part-Decomposed Single-View Articulated Object Generation

📅 2026-02-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing methods for generating high-fidelity articulated 3D objects struggle to simultaneously achieve structural accuracy, motion plausibility, and inference efficiency, often relying on time-consuming optimization or template matching. This work proposes a part-centric generative framework that, for the first time, explicitly incorporates part identity and motion cues into an end-to-end generation pipeline. The approach synthesizes 3D articulated objects directly from a single image—without per-instance optimization—yielding explicit part decomposition, coherent geometry, and physically plausible kinematic relationships. By leveraging a part-aware implicit token representation, the model jointly encodes geometric, compositional, and motion constraints, enabling fast feed-forward inference and controllable assembly. Evaluated on categories such as drawers and doors, the method significantly outperforms optimization- and retrieval-based baselines, demonstrating marked improvements in input consistency, part accuracy, motion plausibility, and inference speed.

Technology Category

Application Category

📝 Abstract
Articulated objects are central to interactive 3D applications, including embodied AI, robotics, and VR/AR, where functional part decomposition and kinematic motion are essential. Yet producing high-fidelity articulated assets remains difficult to scale because it requires reliable part decomposition and kinematic rigging. Existing approaches largely fall into two paradigms: optimization-based reconstruction or distillation, which can be accurate but often takes tens of minutes to hours per instance, and inference-time methods that rely on template or part retrieval, producing plausible results that may not match the specific structure and appearance in the input observation. We introduce a part-centric generative framework for articulated object creation that synthesizes part geometry, composition, and articulation under explicit part-aware conditioning. Our representation models an object as a set of movable parts, each encoded by latent tokens augmented with part identity and articulation cues. Conditioned on a single image, the model generates articulated 3D assets that preserve instance-level correspondence while maintaining valid part structure and motion. The resulting approach avoids per-instance optimization, enables fast feed-forward inference, and supports controllable assembly and articulation, which are important for embodied interaction. Experiments on common articulated categories (e.g., drawers and doors) show improved input consistency, part accuracy, and articulation plausibility over optimization-based and retrieval-driven baselines, while substantially reducing inference time.
Problem

Research questions and friction points this paper is trying to address.

articulated object generation
part decomposition
single-view reconstruction
kinematic motion
3D asset creation
Innovation

Methods, ideas, or system contributions that make the work stand out.

part-decomposed generation
articulated object modeling
single-view 3D reconstruction
latent token representation
feed-forward inference
🔎 Similar Papers
No similar papers found.
Q
Qingming Liu
The Chinese University of Hong Kong, Shenzhen, China and DexForce Technology, China
X
Xinyue Yao
The Chinese University of Hong Kong, Shenzhen, China
S
Shuyuan Zhang
The Chinese University of Hong Kong, Shenzhen, China
Y
Yueci Deng
The Chinese University of Hong Kong, Shenzhen, China and DexForce Technology, China
Guiliang Liu
Guiliang Liu
Chinese University of Hongkong, Shenzhen
Reinforcement LearningMachine Learning
Zhen Liu
Zhen Liu
Chinese University of Hong Kong (Shenzhen)
Machine LearningComputer Vision
K
Kui Jia
The Chinese University of Hong Kong, Shenzhen, China and DexForce Technology, China