🤖 AI Summary
This work addresses the limited generalization capability of existing methods to unseen instances in category-level object pose estimation by proposing the TSM-Pose framework. It introduces the Mamba architecture to this task for the first time, featuring a topology extractor that captures the global topological structure of point clouds while integrating local geometric features. Additionally, the framework incorporates a Mamba-based global semantic aggregator and a TwinMamba module to inject semantic priors and model long-range dependencies, thereby enhancing the representation of semantic keypoints. Extensive experiments on three benchmarks—REAL275, CAMERA25, and HouseCat6D—demonstrate that TSM-Pose consistently outperforms state-of-the-art methods, validating its effectiveness and robustness.
📝 Abstract
Category-level object pose estimation is fundamental for embodied intelligence, yet achieving robust generalization to unseen instances remains challenging. However, existing methods mainly rely on simple feature extraction and aggregation, which struggle to capture category-shared topological structures and conduct semantic keypoint modeling, limiting their generalization. To address these, we propose a \textbf{T}opology-Aware Learning with \textbf{S}emantic \textbf{M}amba for Category-Level \textbf{P}ose Estimation framework (TSM-Pose). Specifically, we introduce a Topology Extractor to capture the global topological representation of the point cloud, which is integrated into local geometry features and enables robust category-level structural representation. Simultaneously, we propose a Mamba-based Global Semantic Aggregator that injects semantics priors into keypoints to enhance their expressiveness and leverages multiple TwinMamba blocks to model long-range dependencies for more effective global feature aggregation. Extensive experiments on three benchmark datasets (REAL275, CAMERA25, and HouseCat6D) demonstrate that TSM-Pose outperforms existing state-of-the-art methods.