🤖 AI Summary
This work addresses the high energy cost of continuous high-fidelity RGB video capture on resource-constrained edge or wearable devices, which hinders always-on video perception. To overcome this challenge, the authors propose a novel “grayscale-always-on, color-on-demand” paradigm: a low-power grayscale stream is continuously captured to preserve temporal structure, while a training-free online triggering mechanism, ColorTrigger, dynamically activates sparse RGB acquisition only when necessary. ColorTrigger detects color redundancy through causal analysis of local similarity in grayscale frames and integrates a credit-budget controller with dynamic token routing to regulate RGB sampling. Experiments demonstrate that the method achieves 91.6% of the full-color baseline performance using merely 8.1% of RGB frames, effectively revealing substantial color redundancy in natural videos and offering an efficient, practical solution for always-on video perception at the edge.
📝 Abstract
Always-on sensing is essential for next-generation edge/wearable AI systems, yet continuous high-fidelity RGB video capture remains prohibitively expensive for resource-constrained mobile and edge platforms. We present a new paradigm for efficient streaming video understanding: grayscale-always, color-on-demand. Through preliminary studies, we discover that color is not always necessary. Sparse RGB frames suffice for comparable performance when temporal structure is preserved via continuous grayscale streams. Building on this insight, we propose ColorTrigger, an online training-free trigger that selectively activates color capture based on windowed grayscale affinity analysis. Designed for real-time edge deployment, ColorTrigger uses lightweight quadratic programming to detect chromatic redundancy causally, coupled with credit-budgeted control and dynamic token routing to jointly reduce sensing and inference costs. On streaming video understanding benchmarks, ColorTrigger achieves 91.6% of full-color baseline performance while using only 8.1% RGB frames, demonstrating substantial color redundancy in natural videos and enabling practical always-on video sensing on resource-constrained devices.