π€ AI Summary
Instance-level recognition (ILR) is severely constrained by data scarcity due to the high cost of fine-grained annotation. To address this, we propose the first end-to-end synthetic data generation framework specifically designed for ILR, requiring only the target domain name as inputβno real images, manual collection, or human labeling are needed. Our method leverages generative models to synthesize diverse object instances across multiple domains, conditions, and backgrounds, and integrates virtual data augmentation with domain-adaptive fine-tuning strategies for visual model training. Evaluated on seven cross-domain ILR benchmarks, models trained exclusively on our synthetic data achieve retrieval performance on par with those trained on real data. This demonstrates the efficacy of synthetic data for representation learning in ILR and establishes a novel zero-real-sample training paradigm.
π Abstract
Instance-level recognition (ILR) focuses on identifying individual objects rather than broad categories, offering the highest granularity in image classification. However, this fine-grained nature makes creating large-scale annotated datasets challenging, limiting ILR's real-world applicability across domains. To overcome this, we introduce a novel approach that synthetically generates diverse object instances from multiple domains under varied conditions and backgrounds, forming a large-scale training set. Unlike prior work on automatic data synthesis, our method is the first to address ILR-specific challenges without relying on any real images. Fine-tuning foundation vision models on the generated data significantly improves retrieval performance across seven ILR benchmarks spanning multiple domains. Our approach offers a new, efficient, and effective alternative to extensive data collection and curation, introducing a new ILR paradigm where the only input is the names of the target domains, unlocking a wide range of real-world applications.