🤖 AI Summary
This work addresses the challenging problem of part-level material composition recognition and localization in 3D objects. To this end, we introduce Grounded CoMPaT Recognition (GCR), a novel task requiring joint identification and spatial grounding of material combinations at the object-part level. We propose 3DCoMPaT++, a large-scale multimodal 3D dataset comprising 160 million rendered images, 10 million stylized 3D models, and fine-grained joint annotations of parts, materials, and semantics—covering 293 material composition classes. Methodologically, we design a Blender-based controllable rendering pipeline, an enhanced PointNet++ adapted for 6D point clouds, and incorporate multi-view sampling alongside a material-part decoupled annotation protocol. Our approach achieved first place in the CVPR 2023 Data Challenge, significantly improving both material localization accuracy and composition recognition performance. 3DCoMPaT++ has since emerged as a key benchmark for compositional 3D vision research.
📝 Abstract
In this work, we present 3DCoMPaT$^{++}$, a multimodal 2D/3D dataset with 160 million rendered views of more than 10 million stylized 3D shapes carefully annotated at the part-instance level, alongside matching RGB point clouds, 3D textured meshes, depth maps, and segmentation masks. 3DCoMPaT$^{++}$ covers 41 shape categories, 275 fine-grained part categories, and 293 fine-grained material classes that can be compositionally applied to parts of 3D objects. We render a subset of one million stylized shapes from four equally spaced views as well as four randomized views, leading to a total of 160 million renderings. Parts are segmented at the instance level, with coarse-grained and fine-grained semantic levels. We introduce a new task, called Grounded CoMPaT Recognition (GCR), to collectively recognize and ground compositions of materials on parts of 3D objects. Additionally, we report the outcomes of a data challenge organized at CVPR2023, showcasing the winning method's utilization of a modified PointNet$^{++}$ model trained on 6D inputs, and exploring alternative techniques for GCR enhancement. We hope our work will help ease future research on compositional 3D Vision.