🤖 AI Summary
This work addresses the challenge of estimating 6D poses of unseen CAD-model objects from RGB images without fine-tuning. To this end, we propose a lightweight network that directly matches correspondences between a query image and a set of reference images through an innovative multi-view reference feature fusion strategy. Our approach drastically reduces the number of required reference images while maintaining high pose estimation accuracy, thereby significantly lowering both storage and computational overhead. Evaluated on the seven core datasets of the BOP Challenge, the method achieves performance comparable to state-of-the-art approaches using fewer reference views and a smaller model footprint, while substantially reducing memory consumption and inference time.
📝 Abstract
We present MixRI, a lightweight network that solves the CAD-based novel object pose estimation problem in RGB images. It can be instantly applied to a novel object at test time without finetuning. We design our network to meet the demands of real-world applications, emphasizing reduced memory requirements and fast inference time. Unlike existing works that utilize many reference images and have large network parameters, we directly match points based on the multi-view information between the query and reference images with a lightweight network. Thanks to our reference image fusion strategy, we significantly decrease the number of reference images, thus decreasing the time needed to process these images and the memory required to store them. Furthermore, with our lightweight network, our method requires less inference time. Though with fewer reference images, experiments on seven core datasets in the BOP challenge show that our method achieves comparable results with other methods that require more reference images and larger network parameters.