🤖 AI Summary
Existing object counting methods struggle in mixed-object scenes, primarily due to the high cost and noise associated with real-world annotations, as well as the limited diversity and realism of synthetic data. To address this, this work proposes the first high-quality synthetic data generation framework tailored for open-vocabulary mixed-object counting. By integrating automated image synthesis, fine-grained textual descriptions, and pixel-level precise annotations, the framework constructs MixCount—a large-scale, unambiguous dataset and benchmark. This approach substantially alleviates the data bottleneck, reducing mean absolute error (MAE) by 20.14% on the FSC-147 benchmark and by 18.3% on PairTally, thereby significantly enhancing model generalization in real-world scenarios.
📝 Abstract
Object counting is a foundational vision task with over a decade of dedicated research, yet state-of-the-art models still fail systematically in the mixed-object setting that dominates real-world applications such as industrial inspection and product sorting. We show that this gap is strongly driven by limitations in existing training and evaluation data: real counting datasets are prohibitively expensive to annotate and suffer from labeling noise, while existing synthetic alternatives lack diversity and realism. We address this with MixCount, a dataset and benchmark for mixed-object counting designed to target the failure modes of current counting models. To overcome the high cost of constructing and labeling such data, we develop an automatic generation pipeline that synthesizes images, fine-grained textual descriptions, and pixel-perfect counting annotations at scale, eliminating the labeling ambiguity that plagues prior datasets. Evaluating state-of-the-art counting models on MixCount exposes severe degradation in the mixed-object setting. More importantly, training these models on our synthesized data yields substantial gains on real-world benchmarks, reducing MAE by 20.14% on FSC-147 and by 18.3% on PairTally. These results establish MixCount as both a benchmark and a training dataset for fine-grained counting, and demonstrate that our pipeline, which produces effectively unlimited labeled data, helps address a long-standing bottleneck in counting models.