Single Domain Generalization for Few-Shot Counting via Universal Representation Matching

📅 2025-05-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Few-shot counting suffers from poor generalization under cross-domain settings, especially degrading sharply in single-source domain generalization—where the target domain is entirely unseen during training. To address this, we propose the first single-domain generalization framework for few-shot counting. Our method introduces two key innovations: (1) distilling general-purpose multimodal representations from vision-language large models to enhance prototype matching robustness against domain shifts; and (2) designing an end-to-end differentiable correlation graph generation mechanism to enable fine-grained cross-domain feature alignment. Extensive experiments demonstrate that our approach achieves state-of-the-art performance on standard few-shot counting benchmarks. Moreover, on a newly constructed domain generalization evaluation set, it significantly outperforms existing methods while preserving strong in-domain accuracy—confirming its effectiveness in balancing cross-domain adaptability and within-domain fidelity.

Technology Category

Application Category

📝 Abstract
Few-shot counting estimates the number of target objects in an image using only a few annotated exemplars. However, domain shift severely hinders existing methods to generalize to unseen scenarios. This falls into the realm of single domain generalization that remains unexplored in few-shot counting. To solve this problem, we begin by analyzing the main limitations of current methods, which typically follow a standard pipeline that extract the object prototypes from exemplars and then match them with image feature to construct the correlation map. We argue that existing methods overlook the significance of learning highly generalized prototypes. Building on this insight, we propose the first single domain generalization few-shot counting model, Universal Representation Matching, termed URM. Our primary contribution is the discovery that incorporating universal vision-language representations distilled from a large scale pretrained vision-language model into the correlation construction process substantially improves robustness to domain shifts without compromising in domain performance. As a result, URM achieves state-of-the-art performance on both in domain and the newly introduced domain generalization setting.
Problem

Research questions and friction points this paper is trying to address.

Few-shot counting struggles with domain shift in unseen scenarios
Existing methods lack generalized prototypes for robust performance
Universal vision-language representations improve domain generalization in counting
Innovation

Methods, ideas, or system contributions that make the work stand out.

Universal vision-language representations enhance robustness
Distilled from large-scale pretrained vision-language model
Improves domain generalization without performance loss
🔎 Similar Papers
No similar papers found.
Xianing Chen
Xianing Chen
Alibaba
VLM
S
Si Huo
Huawei Noah’s Ark Lab.
B
Borui Jiang
Huawei Noah’s Ark Lab.
Hailin Hu
Hailin Hu
Huawei Noah's Ark Lab
X
Xinghao Chen
Huawei Noah’s Ark Lab.