FAST-GOAL: Fast and Efficient Global-local Object Alignment Learning

📅 2026-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation of existing vision-language models, such as CLIP, which struggle with fine-grained image-text alignment on long textual descriptions due to their pretraining on short captions. To overcome this, the authors propose a global-local semantic alignment mechanism that leverages object detection and spatial partitioning to extract image regions, then establishes fine-grained correspondences between image patches and text tokens through token-level similarity learning. They introduce two efficient components—FLISM and TSL—and construct GLIT100k, the first region-level image-text dataset designed to balance global semantics and contextual consistency. Extensive experiments demonstrate significant performance gains over baselines on both long-text benchmarks (e.g., DOCCI, DCI) and standard short-caption datasets (e.g., MSCOCO, Flickr30k), effectively enhancing CLIP’s capacity to comprehend detailed textual descriptions.
📝 Abstract
Vision-language models such as CLIP have shown impressive capabilities in aligning images and text, but they often struggle with lengthy and detailed text descriptions due to pre-training on short and concise captions. We present FAST-GOAL (Fast and Efficient Global-local Object Alignment Learning), an efficient fine-tuning method that enhances ability of CLIP to handle lengthy text through global-local semantic alignment. Our method consists of two key components. First, Fast Local Image-Sentence Matching (FLISM) efficiently extracts local image regions through object detection and spatial division, then matches them with corresponding sentences. Second, Token Similarity-based Learning (TSL) maximizes the similarity between patch tokens from specific regions in the image and their corresponding region embeddings, applying the same principle to text, which enhances the ability of the model to capture detailed correspondences. Additionally, we introduce GLIT100k, a dataset that provides both global image-lengthy caption pairs and context-derived local pairs, where local descriptions are extracted from global captions to maintain semantic coherence. Through extensive experiments on long caption datasets (DOCCI, DCI) and short caption datasets (MSCOCO, Flickr30k), we demonstrate that FAST-GOAL achieves significant improvements over baselines, enabling effective adaptation of CLIP to detailed textual descriptions while maintaining computational efficiency.
Problem

Research questions and friction points this paper is trying to address.

vision-language alignment
long text descriptions
detailed caption understanding
CLIP limitations
global-local semantic alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

global-local alignment
vision-language model
efficient fine-tuning
token similarity learning
object-level matching
🔎 Similar Papers
2024-07-16European Conference on Computer VisionCitations: 1