🤖 AI Summary
Vision-language models (VLMs) suffer from weak fine-grained perception, susceptibility to visual biases, and insensitivity to subtle visual distinctions—largely due to overreliance on coarse-grained recognition in existing training data.
Method: We propose TWIN, the first large-scale image-pair identity discrimination dataset (561K samples), enabling a new fine-grained visual training paradigm; design FGVQA, a cross-domain fine-grained visual question-answering benchmark (12K queries); and adopt image-pair binary classification supervised fine-tuning, supporting multi-source data reconstruction and cross-domain transfer evaluation across mainstream open-source VLM architectures.
Contribution/Results: We empirically validate the critical roles of data scale and annotation density in perception accuracy. On FGVQA, our approach achieves up to +19.3% absolute improvement over strong baselines, generalizes robustly to unseen domains—including art, flora/fauna, and landmarks—and preserves general-purpose VQA performance.
📝 Abstract
Vision-language models (VLMs) excel at broad visual understanding but remain coarse-grained, exhibit visual biases, and miss subtle visual details. Existing training corpora reinforce this limitation by emphasizing general recognition ("Is it a cat or a dog?") over fine-grained perception. To address this, we introduce a new training corpus and task designed to enhance the perceptual abilities of VLMs. TWIN is a large-scale dataset of 561,000 image-pair queries that task models to determine whether two visually similar images depict the same object, encouraging attention to nuanced visual cues. The dataset spans a diverse range of everyday objects across contexts, viewpoints, and appearances. Fine-tuning VLMs on TWIN yields notable gains in fine-grained recognition, even on unseen domains such as art, animals, plants, and landmarks. To quantify these gains, we introduce FGVQA, a benchmark suite of 12,000 queries that repurposes fine-grained recognition and retrieval datasets from multiple domains. While existing VLMs struggle on FGVQA, when fine-tuned on TWIN they improve by up to 19.3%, without compromising performance on general VQA benchmarks. Finally, our TWIN dataset scales favorably with object annotations, and our analysis shows that scale is key to performance. We envision TWIN as a drop-in addition to open-source VLM training corpora, advancing perceptual precision of future models. Project webpage: https://glab-caltech.github.io/twin/