THOR: Thermal-guided Hand-Object Reasoning via Adaptive Vision Sampling

📅 2025-07-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Wearable RGB cameras suffer from excessive power consumption, massive redundant data generation, privacy leakage, and high computational overhead due to continuous video capture. To address these challenges, we propose THOR: a thermal-aware adaptive visual sampling framework. THOR is the first to leverage low-power thermal imaging to detect hand activity transitions, dynamically modulating the RGB sampling rate and localizing hand-object interaction regions of interest (ROIs) for spatiotemporally adaptive, lightweight video acquisition. By tightly integrating thermal sensing, activity transition detection, and ROI-guided sampling, THOR enables real-time egocentric hand-object action recognition entirely on-device. Experiments demonstrate that THOR captures all activity segments using only 3% of original RGB frames, achieving a 95% F1-score on the Ego4D benchmark—matching full-frame processing accuracy—while substantially reducing bandwidth usage, energy consumption, and privacy risks.

Technology Category

Application Category

📝 Abstract
Wearable cameras are increasingly used as an observational and interventional tool for human behaviors by providing detailed visual data of hand-related activities. This data can be leveraged to facilitate memory recall for logging of behavior or timely interventions aimed at improving health. However, continuous processing of RGB images from these cameras consumes significant power impacting battery lifetime, generates a large volume of unnecessary video data for post-processing, raises privacy concerns, and requires substantial computational resources for real-time analysis. We introduce THOR, a real-time adaptive spatio-temporal RGB frame sampling method that leverages thermal sensing to capture hand-object patches and classify them in real-time. We use low-resolution thermal camera data to identify moments when a person switches from one hand-related activity to another, and adjust the RGB frame sampling rate by increasing it during activity transitions and reducing it during periods of sustained activity. Additionally, we use the thermal cues from the hand to localize the region of interest (i.e., the hand-object interaction) in each RGB frame, allowing the system to crop and process only the necessary part of the image for activity recognition. We develop a wearable device to validate our method through an in-the-wild study with 14 participants and over 30 activities, and further evaluate it on Ego4D (923 participants across 9 countries, totaling 3,670 hours of video). Our results show that using only 3% of the original RGB video data, our method captures all the activity segments, and achieves hand-related activity recognition F1-score (95%) comparable to using the entire RGB video (94%). Our work provides a more practical path for the longitudinal use of wearable cameras to monitor hand-related activities and health-risk behaviors in real time.
Problem

Research questions and friction points this paper is trying to address.

Reduces power consumption in wearable cameras by adaptive RGB sampling
Minimizes unnecessary video data and privacy concerns via thermal sensing
Enables real-time hand-object activity recognition with low RGB usage
Innovation

Methods, ideas, or system contributions that make the work stand out.

Thermal-guided adaptive RGB frame sampling
Low-resolution thermal data triggers RGB sampling
Thermal cues localize hand-object interaction regions