Descriptive Image-Text Matching with Graded Contextual Similarity

📅 2025-05-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing image–text matching methods rely on sparse binary supervision, limiting their ability to model many-to-many and hierarchical semantic relationships—from general to specific—between images and texts. This work proposes a novel descriptive image–text matching paradigm. It introduces, for the first time, cumulative TF-IDF to quantify sentence-level descriptive granularity, thereby constructing an ordered, general-to-specific matching structure. We further design a dynamic similarity-weighted loss and a hierarchical alignment training strategy to jointly model multi-granularity contextual similarities. Our approach effectively alleviates semantic coverage deficiency. Extensive experiments demonstrate state-of-the-art performance on MS-COCO, Flickr30K, and CxC. Moreover, evaluation on the HierarCaps benchmark confirms its superior capability for hierarchical cross-modal reasoning.

Technology Category

Application Category

📝 Abstract
Image-text matching aims to build correspondences between visual and textual data by learning their pairwise similarities. Most existing approaches have adopted sparse binary supervision, indicating whether a pair of images and sentences matches or not. However, such sparse supervision covers a limited subset of image-text relationships, neglecting their inherent many-to-many correspondences; an image can be described in numerous texts at different descriptive levels. Moreover, existing approaches overlook the implicit connections from general to specific descriptions, which form the underlying rationale for the many-to-many relationships between vision and language. In this work, we propose descriptive image-text matching, called DITM, to learn the graded contextual similarity between image and text by exploring the descriptive flexibility of language. We formulate the descriptiveness score of each sentence with cumulative term frequency-inverse document frequency (TF-IDF) to balance the pairwise similarity according to the keywords in the sentence. Our method leverages sentence descriptiveness to learn robust image-text matching in two key ways: (1) to refine the false negative labeling, dynamically relaxing the connectivity between positive and negative pairs, and (2) to build more precise matching, aligning a set of relevant sentences in a generic-to-specific order. By moving beyond rigid binary supervision, DITM enhances the discovery of both optimal matches and potential positive pairs. Extensive experiments on MS-COCO, Flickr30K, and CxC datasets demonstrate the effectiveness of our method in representing complex image-text relationships compared to state-of-the-art approaches. In addition, DITM enhances the hierarchical reasoning ability of the model, supported by the extensive analysis on HierarCaps benchmark.
Problem

Research questions and friction points this paper is trying to address.

Learning graded contextual similarity between images and texts
Addressing many-to-many correspondences in image-text relationships
Enhancing hierarchical reasoning in image-text matching models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses graded contextual similarity for image-text matching
Formulates descriptiveness score with cumulative TF-IDF
Dynamically refines false negative labeling and aligns sentences
🔎 Similar Papers
2023-10-04International Conference on Pattern RecognitionCitations: 1