Same or Not? Enhancing Visual Perception in Vision-Language Models

📅 2025-12-29

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

Vision-language models (VLMs) suffer from weak fine-grained perception, susceptibility to visual biases, and insensitivity to subtle visual distinctions—largely due to overreliance on coarse-grained recognition in existing training data. Method: We propose TWIN, the first large-scale image-pair identity discrimination dataset (561K samples), enabling a new fine-grained visual training paradigm; design FGVQA, a cross-domain fine-grained visual question-answering benchmark (12K queries); and adopt image-pair binary classification supervised fine-tuning, supporting multi-source data reconstruction and cross-domain transfer evaluation across mainstream open-source VLM architectures. Contribution/Results: We empirically validate the critical roles of data scale and annotation density in perception accuracy. On FGVQA, our approach achieves up to +19.3% absolute improvement over strong baselines, generalizes robustly to unseen domains—including art, flora/fauna, and landmarks—and preserves general-purpose VQA performance.

Technology Category

Application Category

📝 Abstract

Vision-language models (VLMs) excel at broad visual understanding but remain coarse-grained, exhibit visual biases, and miss subtle visual details. Existing training corpora reinforce this limitation by emphasizing general recognition ("Is it a cat or a dog?") over fine-grained perception. To address this, we introduce a new training corpus and task designed to enhance the perceptual abilities of VLMs. TWIN is a large-scale dataset of 561,000 image-pair queries that task models to determine whether two visually similar images depict the same object, encouraging attention to nuanced visual cues. The dataset spans a diverse range of everyday objects across contexts, viewpoints, and appearances. Fine-tuning VLMs on TWIN yields notable gains in fine-grained recognition, even on unseen domains such as art, animals, plants, and landmarks. To quantify these gains, we introduce FGVQA, a benchmark suite of 12,000 queries that repurposes fine-grained recognition and retrieval datasets from multiple domains. While existing VLMs struggle on FGVQA, when fine-tuned on TWIN they improve by up to 19.3%, without compromising performance on general VQA benchmarks. Finally, our TWIN dataset scales favorably with object annotations, and our analysis shows that scale is key to performance. We envision TWIN as a drop-in addition to open-source VLM training corpora, advancing perceptual precision of future models. Project webpage: https://glab-caltech.github.io/twin/

Problem

Research questions and friction points this paper is trying to address.

Enhance fine-grained visual perception in VLMs

Address visual biases and missed subtle details

Improve recognition across diverse unseen domains

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces TWIN dataset for fine-grained visual perception training

Fine-tunes VLMs on image-pair queries to improve detail recognition

Uses FGVQA benchmark to measure gains in unseen domains

🔎 Similar Papers

Pre-trained Vision-Language Models Learn Discoverable Visual Concepts