Wake Vision: A Tailored Dataset and Benchmark Suite for TinyML Computer Vision Applications

📅 2024-05-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
TinyML faces a systemic bottleneck in deploying high-accuracy person detection on ultra-low-power devices due to the scarcity of large-scale, high-quality training data. To address this, we introduce Wake Vision—the first TinyML-specific dataset for person detection, comprising over 6 million images and pioneering a dual-version data strategy: “Large” (scale-optimized) and “Quality” (accuracy-optimized). We curate a human-verified validation set, reducing mislabeling rates from 7.8% to 2.2%. Additionally, we establish a robustness benchmark covering five realistic scenarios—varying illumination, distance, and demographic diversity. Leveraging data quality filtering, knowledge distillation–based pretraining, and a TinyML-adapted evaluation framework, our approach improves detection accuracy by 1.93% and reduces validation error by 5.6 percentage points on representative models. All data, code, and models are publicly released under the CC-BY 4.0 license.

Technology Category

Application Category

📝 Abstract
Tiny machine learning (TinyML) for low-power devices lacks robust datasets for development. We present Wake Vision, a large-scale dataset for person detection that contains over 6 million quality-filtered images. We provide two variants: Wake Vision (Large) and Wake Vision (Quality), leveraging the large variant for pretraining and knowledge distillation, while the higher-quality labels drive final model performance. The manually labeled validation and test sets reduce error rates from 7.8% to 2.2% compared to previous standards. In addition, we introduce five detailed benchmark sets to evaluate model performance in real-world scenarios, including varying lighting, camera distances, and demographic characteristics. Training with Wake Vision improves accuracy by 1.93% over existing datasets, demonstrating the importance of dataset quality for low-capacity models and dataset size for high-capacity models. The dataset, benchmarks, code, and models are available under the CC-BY 4.0 license, maintained by the Edge AI Foundation.
Problem

Research questions and friction points this paper is trying to address.

Lack of systematic methodologies for TinyML dataset creation
Need for automated pipeline to generate high-quality TinyML datasets
Absence of tailored datasets for specific TinyML deployment constraints
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated pipeline for binary classification datasets
Intelligent multi-source label fusion and correction
Comprehensive fine-grained benchmark suite evaluation
🔎 Similar Papers
No similar papers found.