🤖 AI Summary
Existing vision-language models (VLMs) struggle with fine-grained recognition due to their inability to capture subtle discriminative features among subordinate categories. This limitation stems primarily from conventional alignment-based prediction frameworks—lacking inter-class interaction—and restricted unimodal feature modeling capacity. To address this, we propose a cross-relational modeling approach: (1) a multi-part visual encoder and learnable multi-perspective text prompts jointly construct a cross-modal relational fusion attention mechanism, enabling interactive, class-aware forward inference; and (2) a VLM-adapted joint optimization framework integrating both modalities holistically. Our method departs from the dominant single-alignment paradigm, achieving significant improvements over state-of-the-art methods on fine-grained benchmarks including CUB-200-2011 and Stanford Cars. Extensive experiments validate that explicit cross-modal relational modeling substantially enhances discriminability of subtle visual distinctions while maintaining strong generalization across diverse fine-grained domains.
📝 Abstract
Vision-Language Models (VLMs) have demonstrated impressive performance on various visual tasks, yet they still require adaptation on downstream tasks to achieve optimal performance. Recently, various adaptation technologies have been proposed, but we observe they often underperform in fine-grained visual recognition, which requires models to capture subtle yet discriminative features to distinguish similar sub-categories. Current adaptation methods typically rely on an alignment-based prediction framework, ie the visual feature is compared with each class prompt for similarity calculation as the final prediction, which lacks class interaction during the forward pass. Besides, learning single uni-modal feature further restricts the model's expressive capacity. Therefore, we propose a novel mechanism, XR-VLM, to discover subtle differences by modeling cross-relationships, which specifically excels in scenarios involving multiple features. Our method introduces a unified multi-part visual feature extraction module designed to seamlessly integrate with the diverse backbones inherent in VLMs. Additionally, we develop a multi-part prompt learning module to capture multi-perspective descriptions of sub-categories. To further enhance discriminative capability, we propose a cross relationship modeling pattern that combines visual feature with all class prompt features, enabling a deeper exploration of the relationships between these two modalities. Extensive experiments have been conducted on various fine-grained datasets, and the results demonstrate that our method achieves significant improvements compared to current state-of-the-art approaches. Code will be released.