Wake Vision: A Tailored Dataset and Benchmark Suite for TinyML Computer Vision Applications

📅 2024-05-01

📈 Citations: 0

✨ Influential: 0

career value

230K/year

🤖 AI Summary

TinyML faces a systemic bottleneck in deploying high-accuracy person detection on ultra-low-power devices due to the scarcity of large-scale, high-quality training data. To address this, we introduce Wake Vision—the first TinyML-specific dataset for person detection, comprising over 6 million images and pioneering a dual-version data strategy: “Large” (scale-optimized) and “Quality” (accuracy-optimized). We curate a human-verified validation set, reducing mislabeling rates from 7.8% to 2.2%. Additionally, we establish a robustness benchmark covering five realistic scenarios—varying illumination, distance, and demographic diversity. Leveraging data quality filtering, knowledge distillation–based pretraining, and a TinyML-adapted evaluation framework, our approach improves detection accuracy by 1.93% and reduces validation error by 5.6 percentage points on representative models. All data, code, and models are publicly released under the CC-BY 4.0 license.

Technology Category

Application Category

📝 Abstract

Tiny machine learning (TinyML) for low-power devices lacks robust datasets for development. We present Wake Vision, a large-scale dataset for person detection that contains over 6 million quality-filtered images. We provide two variants: Wake Vision (Large) and Wake Vision (Quality), leveraging the large variant for pretraining and knowledge distillation, while the higher-quality labels drive final model performance. The manually labeled validation and test sets reduce error rates from 7.8% to 2.2% compared to previous standards. In addition, we introduce five detailed benchmark sets to evaluate model performance in real-world scenarios, including varying lighting, camera distances, and demographic characteristics. Training with Wake Vision improves accuracy by 1.93% over existing datasets, demonstrating the importance of dataset quality for low-capacity models and dataset size for high-capacity models. The dataset, benchmarks, code, and models are available under the CC-BY 4.0 license, maintained by the Edge AI Foundation.

Problem

Research questions and friction points this paper is trying to address.

Lack of systematic methodologies for TinyML dataset creation

Need for automated pipeline to generate high-quality TinyML datasets

Absence of tailored datasets for specific TinyML deployment constraints

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated pipeline for binary classification datasets

Intelligent multi-source label fusion and correction

Comprehensive fine-grained benchmark suite evaluation

🔎 Similar Papers

No similar papers found.