๐ค AI Summary
This work addresses the challenging problem of 3D articulated object reconstruction from sparse multi-view images. We propose the first end-to-end feed-forward framework that jointly recovers high-fidelity geometry, photorealistic texture, and physically plausible joint motion structure. Our method extends LVSM with a Transformer-based architecture to jointly infer camera poses and articulated motionโwithout requiring dense viewpoints or per-instance optimization. By co-modeling novel view synthesis, depth maps, and part masks in a unified, differentiable pipeline, it enables explicit 3D mesh reconstruction. Compared to existing optimization-based and feed-forward approaches, our method achieves significant improvements in geometric accuracy, texture fidelity, and kinematic plausibility. It establishes new state-of-the-art performance on both novel view synthesis and articulated object reconstruction, offering both high reconstruction fidelity and strong scalability.
๐ Abstract
Modeling 3D articulated objects with realistic geometry, textures, and kinematics is essential for a wide range of applications. However, existing optimization-based reconstruction methods often require dense multi-view inputs and expensive per-instance optimization, limiting their scalability. Recent feedforward approaches offer faster alternatives but frequently produce coarse geometry, lack texture reconstruction, and rely on brittle, complex multi-stage pipelines. We introduce LARM, a unified feedforward framework that reconstructs 3D articulated objects from sparse-view images by jointly recovering detailed geometry, realistic textures, and accurate joint structures. LARM extends LVSM a recent novel view synthesis (NVS) approach for static 3D objects into the articulated setting by jointly reasoning over camera pose and articulation variation using a transformer-based architecture, enabling scalable and accurate novel view synthesis. In addition, LARM generates auxiliary outputs such as depth maps and part masks to facilitate explicit 3D mesh extraction and joint estimation. Our pipeline eliminates the need for dense supervision and supports high-fidelity reconstruction across diverse object categories. Extensive experiments demonstrate that LARM outperforms state-of-the-art methods in both novel view and state synthesis as well as 3D articulated object reconstruction, generating high-quality meshes that closely adhere to the input images. project page: https://sylviayuan-sy.github.io/larm-site/