🤖 AI Summary
Existing image–text matching methods struggle to balance performance and efficiency and are often susceptible to interference from irrelevant fragments. This work proposes the OMIT network, which for the first time integrates optimal partial transport with Sinkhorn iterations to compute fine-grained similarity between cross-modal fragments via Mover’s Distance. By introducing a partial matching mechanism that discards redundant alignments, OMIT achieves focused matching from local to global levels. The method effectively suppresses irrelevant distractions and attains state-of-the-art performance on both the Flickr30K and MS-COCO benchmarks. Visualization results further corroborate its precise alignment capability.
📝 Abstract
Cross-modal matching, a fundamental task in bridging vision and language, has recently garnered substantial research interest. Despite the development of numerous methods aimed at quantifying the semantic relatedness between image-text pairs, these methods often fall short of achieving both outstanding performance and high efficiency. In this paper, we propose the crOss-Modal sInkhorn maTching (OMIT) network as an effective solution to effectively improving performance while maintaining efficiency. Rooted in the theoretical foundations of Optimal Transport, OMIT harnesses the capabilities of Cross-modal Mover's Distance to precisely compute the similarity between fine-grained visual and textual fragments, utilizing Sinkhorn iterations for efficient approximation. To further alleviate the issue of redundant alignments, we seamlessly integrate partial matching into OMIT, leveraging local-to-global similarities to eliminate the interference of irrelevant fragments. We conduct extensive evaluations of OMIT on two benchmark image-text retrieval datasets, namely Flickr30K and MS-COCO. The superior performance achieved by OMIT on both datasets unequivocally demonstrates its effectiveness in cross-modal matching. Furthermore, through comprehensive visualization analysis, we elucidate OMIT's inherent tendency towards focal matching, thereby shedding light on its efficacy. Our code is publicly available at https://github.com/ppanzx/OMIT.