🤖 AI Summary
This work addresses inter-class confusion in object recognition and retrieval caused by state changes (e.g., folding, deformation), formally defining and modeling “state invariance”—the ability to distinguish fine-grained differences between objects under similar states while remaining robust to viewpoint, pose, and illumination variations. To this end, we introduce ObjectsWithStateChange, the first fine-grained dataset featuring multi-state and multi-view object images. We further propose a state-invariant object representation learning framework integrating joint state-pose modeling, curriculum-based hard-negative contrastive learning, and multi-view feature embedding. Evaluated on our dataset and three established benchmarks—including ModelNet40—the method achieves +7.9% accuracy in object recognition and +9.2% mean average precision (mAP) in retrieval, significantly outperforming prior approaches.
📝 Abstract
We add one more invariance - the state invariance - to the more commonly used other invariances for learning object representations for recognition and retrieval. By state invariance, we mean robust with respect to changes in the structural form of the objects, such as when an umbrella is folded, or when an item of clothing is tossed on the floor. In this work, we present a novel dataset, ObjectsWithStateChange, which captures state and pose variations in the object images recorded from arbitrary viewpoints. We believe that this dataset will facilitate research in fine-grained object recognition and retrieval of 3D objects that are capable of state changes. The goal of such research would be to train models capable of learning discriminative object embeddings that remain invariant to state changes while also staying invariant to transformations induced by changes in viewpoint, pose, illumination, etc. A major challenge in this regard is that instances of different objects (both within and across different categories) under various state changes may share similar visual characteristics and therefore may be close to one another in the learned embedding space, which would make it more difficult to discriminate between them. To address this, we propose a curriculum learning strategy that progressively selects object pairs with smaller inter-object distances in the learned embedding space during the training phase. This approach gradually samples harder-to-distinguish examples of visually similar objects, both within and across different categories. Our ablation related to the role played by curriculum learning indicates an improvement in object recognition accuracy of 7.9% and retrieval mAP of 9.2% over the state-of-the-art on our new dataset, as well as three other challenging multi-view datasets such as ModelNet40, ObjectPI, and FG3D.