🤖 AI Summary
This work addresses the distribution shift caused by diverse query styles—such as sketches, artistic renderings, and low-resolution images—in image retrieval and classification tasks. To tackle this challenge, the authors propose Hystar, a lightweight framework that, for the first time, integrates hypernetworks with dynamic singular value decomposition (SVD) modulation to generate input-style-adaptive perturbations for attention layers, while introducing static SVD offsets in MLP layers to enhance cross-style stability. Additionally, they design a StyleNCE loss based on optimal transport weighting to improve semantic discrimination of hard negative samples. Evaluated on multi-style image retrieval and cross-style classification benchmarks, Hystar achieves state-of-the-art performance while maintaining high parameter efficiency and robustness across diverse visual styles.
📝 Abstract
Query-based image retrieval (QBIR) requires retrieving relevant images given diverse and often stylistically heterogeneous queries, such as sketches, artworks, or low-resolution previews. While large-scale vision--language representation models (VLRMs) like CLIP offer strong zero-shot retrieval performance, they struggle with distribution shifts caused by unseen query styles. In this paper, we propose the Hypernetwork-driven Style-adaptive Retrieval (Hystar), a lightweight framework that dynamically adapts model weights to each query's style. Hystar employs a hypernetwork to generate singular-value perturbations ($ΔS$) for attention layers, enabling flexible per-input adaptation, while static singular-value offsets on MLP layers ensure cross-style stability. To better handle semantic confusions across styles, we design StyleNCE as part of Hystar, an optimal-transport-weighted contrastive loss that emphasizes hard cross-style negatives. Extensive experiments on multi-style retrieval and cross-style classification benchmarks demonstrate that Hystar consistently outperforms strong baselines, achieving state-of-the-art performance while being parameter-efficient and stable across styles.