Descriptive Image-Text Matching with Graded Contextual Similarity

📅 2025-05-15

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Existing image–text matching methods rely on sparse binary supervision, limiting their ability to model many-to-many and hierarchical semantic relationships—from general to specific—between images and texts. This work proposes a novel descriptive image–text matching paradigm. It introduces, for the first time, cumulative TF-IDF to quantify sentence-level descriptive granularity, thereby constructing an ordered, general-to-specific matching structure. We further design a dynamic similarity-weighted loss and a hierarchical alignment training strategy to jointly model multi-granularity contextual similarities. Our approach effectively alleviates semantic coverage deficiency. Extensive experiments demonstrate state-of-the-art performance on MS-COCO, Flickr30K, and CxC. Moreover, evaluation on the HierarCaps benchmark confirms its superior capability for hierarchical cross-modal reasoning.

Technology Category

Application Category

📝 Abstract

Image-text matching aims to build correspondences between visual and textual data by learning their pairwise similarities. Most existing approaches have adopted sparse binary supervision, indicating whether a pair of images and sentences matches or not. However, such sparse supervision covers a limited subset of image-text relationships, neglecting their inherent many-to-many correspondences; an image can be described in numerous texts at different descriptive levels. Moreover, existing approaches overlook the implicit connections from general to specific descriptions, which form the underlying rationale for the many-to-many relationships between vision and language. In this work, we propose descriptive image-text matching, called DITM, to learn the graded contextual similarity between image and text by exploring the descriptive flexibility of language. We formulate the descriptiveness score of each sentence with cumulative term frequency-inverse document frequency (TF-IDF) to balance the pairwise similarity according to the keywords in the sentence. Our method leverages sentence descriptiveness to learn robust image-text matching in two key ways: (1) to refine the false negative labeling, dynamically relaxing the connectivity between positive and negative pairs, and (2) to build more precise matching, aligning a set of relevant sentences in a generic-to-specific order. By moving beyond rigid binary supervision, DITM enhances the discovery of both optimal matches and potential positive pairs. Extensive experiments on MS-COCO, Flickr30K, and CxC datasets demonstrate the effectiveness of our method in representing complex image-text relationships compared to state-of-the-art approaches. In addition, DITM enhances the hierarchical reasoning ability of the model, supported by the extensive analysis on HierarCaps benchmark.

Problem

Research questions and friction points this paper is trying to address.

Learning graded contextual similarity between images and texts

Addressing many-to-many correspondences in image-text relationships

Enhancing hierarchical reasoning in image-text matching models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses graded contextual similarity for image-text matching

Formulates descriptiveness score with cumulative TF-IDF

Dynamically refines false negative labeling and aligns sentences

🔎 Similar Papers

Clustering-based Image-Text Graph Matching for Domain Generalization