MonoArt: Progressive Structural Reasoning for Monocular Articulated 3D Reconstruction

📅 2026-03-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Reconstructing the 3D structure of articulated objects from a single image requires joint inference of geometry, part hierarchy, and motion parameters; however, the tight coupling between motion and structure often leads to unstable regression. This work proposes MonoArt, the first unified framework enabling end-to-end, progressive joint reasoning of structure and motion, eliminating the need for multi-stage pipelines or external motion templates. Within a single architecture, MonoArt sequentially generates canonical geometry, structured part representations, and motion-aware embeddings, achieving stable and interpretable reconstruction. The method attains state-of-the-art accuracy and speed on PartNet-Mobility and demonstrates successful generalization to robotic manipulation tasks and real-world articulated scene reconstruction.

Technology Category

Application Category

📝 Abstract
Reconstructing articulated 3D objects from a single image requires jointly inferring object geometry, part structure, and motion parameters from limited visual evidence. A key difficulty lies in the entanglement between motion cues and object structure, which makes direct articulation regression unstable. Existing methods address this challenge through multi-view supervision, retrieval-based assembly, or auxiliary video generation, often sacrificing scalability or efficiency. We present MonoArt, a unified framework grounded in progressive structural reasoning. Rather than predicting articulation directly from image features, MonoArt progressively transforms visual observations into canonical geometry, structured part representations, and motion-aware embeddings within a single architecture. This structured reasoning process enables stable and interpretable articulation inference without external motion templates or multi-stage pipelines. Extensive experiments on PartNet-Mobility demonstrate that OM achieves state-of-the-art performance in both reconstruction accuracy and inference speed. The framework further generalizes to robotic manipulation and articulated scene reconstruction.
Problem

Research questions and friction points this paper is trying to address.

monocular 3D reconstruction
articulated objects
structural reasoning
motion-structure entanglement
single-image inference
Innovation

Methods, ideas, or system contributions that make the work stand out.

progressive structural reasoning
monocular articulated 3D reconstruction
motion-aware embedding
canonical geometry
structured part representation
🔎 Similar Papers
No similar papers found.
H
Haitian Li
S-Lab, Nanyang Technological University, 637335 Singapore
Haozhe Xie
Haozhe Xie
Nanyang Technological University
Computer Vision3D VisionGenerative AIRobotics
J
Junxiang Xu
S-Lab, Nanyang Technological University, 637335 Singapore
B
Beichen Wen
S-Lab, Nanyang Technological University, 637335 Singapore
Fangzhou Hong
Fangzhou Hong
Nanyang Technological University
3D Computer Vision
Ziwei Liu
Ziwei Liu
Associate Professor, Nanyang Technological University
Computer VisionMachine LearningComputer Graphics