Let's Roll a BiFTA: Bi-refinement for Fine-grained Text-visual Alignment in Vision-Language Models

📅 2026-01-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the degradation of zero-shot performance in vision-language models caused by redundant image patches and textual descriptions in fine-grained image-text alignment. To mitigate this issue, the authors propose BiFTA, a novel method that introduces a dual-path refinement mechanism: on the visual side, highly overlapping image patches are pruned based on an IoU threshold, while on the textual side, redundant descriptions are filtered using cosine similarity. This co-refinement strategy enhances both the discriminability and diversity of cross-modal alignments. BiFTA is seamlessly integrated into the CLIP framework and is compatible with both ViT and ResNet backbones. Extensive experiments across six benchmark datasets demonstrate significant improvements in zero-shot performance, validating the effectiveness of redundancy elimination for fine-grained alignment.

Technology Category

Application Category

📝 Abstract
Recent research has shown that aligning fine-grained text descriptions with localized image patches can significantly improve the zero-shot performance of pre-trained vision-language models (e.g., CLIP). However, we find that both fine-grained text descriptions and localized image patches often contain redundant information, making text-visual alignment less effective. In this paper, we tackle this issue from two perspectives: \emph{View Refinement} and \emph{Description refinement}, termed as \textit{\textbf{Bi}-refinement for \textbf{F}ine-grained \textbf{T}ext-visual \textbf{A}lignment} (BiFTA). \emph{View refinement} removes redundant image patches with high \emph{Intersection over Union} (IoU) ratios, resulting in more distinctive visual samples. \emph{Description refinement} removes redundant text descriptions with high pairwise cosine similarity, ensuring greater diversity in the remaining descriptions. BiFTA achieves superior zero-shot performance on 6 benchmark datasets for both ViT-based and ResNet-based CLIP, justifying the necessity to remove redundant information in visual-text alignment.
Problem

Research questions and friction points this paper is trying to address.

fine-grained alignment
vision-language models
redundant information
text-visual alignment
zero-shot performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bi-refinement
Fine-grained Alignment
Redundancy Removal
Vision-Language Models
Zero-shot Learning
🔎 Similar Papers
No similar papers found.