π€ AI Summary
This work addresses the challenge that existing open-world, promptable 3D semantic part segmentation methods operate in sensor coordinates, hindering their ability to model stable semantics of object functional roles. To overcome this limitation, we propose a learning framework based on an implicit canonical reference frame, employing a dual-branch architecture to achieve canonical mapping anchoring and bounding box calibration, thereby transferring perception from the input pose space to a unified canonical space. We innovatively introduce a large language modelβguided cross-category canonical alignment mechanism, construct a unified canonical dataset spanning 200 categories, and learn pose-invariant canonical embeddings within the model. Our approach achieves state-of-the-art performance on open-world promptable 3D part segmentation, significantly enhancing segmentation stability and cross-category generalization.
π Abstract
Open-world promptable 3D semantic segmentation remains brittle as semantics are inferred in the input sensor coordinates. Yet, humans, in contrast, interpret parts via functional roles in a canonical space -- wings extend laterally, handles protrude to the side, and legs support from below. Psychophysical evidence shows that we mentally rotate objects into canonical frames to reveal these roles. To fill this gap, we propose \methodName{}, which attains canonical space perception by inducing a latent canonical reference frame learned directly from data. By construction, we create a unified canonical dataset through LLM-guided intra- and cross-category alignment, exposing canonical spatial regularities across 200 categories. By induction, we realize canonicality inside the model through a dual-branch architecture with canonical map anchoring and canonical box calibration, collapsing pose variation and symmetry into a stable canonical embedding. This shift from input pose space to canonical embedding yields far more stable and transferable part semantics. Experimental results show that \methodName{} establishes new state of the art in open-world promptable 3D segmentation.