FAST-GOAL: Fast and Efficient Global-local Object Alignment Learning

📅 2026-05-26

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This work addresses the limitation of existing vision-language models, such as CLIP, which struggle with fine-grained image-text alignment on long textual descriptions due to their pretraining on short captions. To overcome this, the authors propose a global-local semantic alignment mechanism that leverages object detection and spatial partitioning to extract image regions, then establishes fine-grained correspondences between image patches and text tokens through token-level similarity learning. They introduce two efficient components—FLISM and TSL—and construct GLIT100k, the first region-level image-text dataset designed to balance global semantics and contextual consistency. Extensive experiments demonstrate significant performance gains over baselines on both long-text benchmarks (e.g., DOCCI, DCI) and standard short-caption datasets (e.g., MSCOCO, Flickr30k), effectively enhancing CLIP’s capacity to comprehend detailed textual descriptions.

📝 Abstract

Vision-language models such as CLIP have shown impressive capabilities in aligning images and text, but they often struggle with lengthy and detailed text descriptions due to pre-training on short and concise captions. We present FAST-GOAL (Fast and Efficient Global-local Object Alignment Learning), an efficient fine-tuning method that enhances ability of CLIP to handle lengthy text through global-local semantic alignment. Our method consists of two key components. First, Fast Local Image-Sentence Matching (FLISM) efficiently extracts local image regions through object detection and spatial division, then matches them with corresponding sentences. Second, Token Similarity-based Learning (TSL) maximizes the similarity between patch tokens from specific regions in the image and their corresponding region embeddings, applying the same principle to text, which enhances the ability of the model to capture detailed correspondences. Additionally, we introduce GLIT100k, a dataset that provides both global image-lengthy caption pairs and context-derived local pairs, where local descriptions are extracted from global captions to maintain semantic coherence. Through extensive experiments on long caption datasets (DOCCI, DCI) and short caption datasets (MSCOCO, Flickr30k), we demonstrate that FAST-GOAL achieves significant improvements over baselines, enabling effective adaptation of CLIP to detailed textual descriptions while maintaining computational efficiency.

Problem

Research questions and friction points this paper is trying to address.

vision-language alignment

long text descriptions

detailed caption understanding

CLIP limitations

global-local semantic alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

global-local alignment

vision-language model

efficient fine-tuning