🤖 AI Summary
Reconstructing the 3D structure of articulated objects from a single image requires joint inference of geometry, part hierarchy, and motion parameters; however, the tight coupling between motion and structure often leads to unstable regression. This work proposes MonoArt, the first unified framework enabling end-to-end, progressive joint reasoning of structure and motion, eliminating the need for multi-stage pipelines or external motion templates. Within a single architecture, MonoArt sequentially generates canonical geometry, structured part representations, and motion-aware embeddings, achieving stable and interpretable reconstruction. The method attains state-of-the-art accuracy and speed on PartNet-Mobility and demonstrates successful generalization to robotic manipulation tasks and real-world articulated scene reconstruction.
📝 Abstract
Reconstructing articulated 3D objects from a single image requires jointly inferring object geometry, part structure, and motion parameters from limited visual evidence. A key difficulty lies in the entanglement between motion cues and object structure, which makes direct articulation regression unstable. Existing methods address this challenge through multi-view supervision, retrieval-based assembly, or auxiliary video generation, often sacrificing scalability or efficiency. We present MonoArt, a unified framework grounded in progressive structural reasoning. Rather than predicting articulation directly from image features, MonoArt progressively transforms visual observations into canonical geometry, structured part representations, and motion-aware embeddings within a single architecture. This structured reasoning process enables stable and interpretable articulation inference without external motion templates or multi-stage pipelines. Extensive experiments on PartNet-Mobility demonstrate that OM achieves state-of-the-art performance in both reconstruction accuracy and inference speed. The framework further generalizes to robotic manipulation and articulated scene reconstruction.