🤖 AI Summary
This study addresses the challenges of cross-user and cross-scenario generalization and privacy preservation in human activity recognition within home environments. Existing approaches rely heavily on camera-based, pre-defined labels that often misalign with practical sensing capabilities. To overcome this, the work proposes a privacy-preserving activity discovery framework centered on non-visual sensors—such as radar, thermal imaging, and LiDAR—that stably capture natural signal patterns. By adaptively invoking vision-language models only on key frames to interpret scenes, the method autonomously discovers discrete activity categories, thereby shifting away from the conventional camera-centric annotation paradigm and substantially reducing reliance on visual models. Experiments with 12 participants demonstrate that environmental sensors alone achieve 79% accuracy in recognizing 4–5 coarse-grained activities; integrating wearable and depth sensors yields 73% accuracy for 8–9 fine-grained activities (averaging 77%), while reducing visual queries by 90%, thus balancing privacy and generalization effectively.
📝 Abstract
Deploying human activity recognition (HAR) at home is still rare because sensor signals vary wildly across houses, people, and time, essentially requiring in-situ data collection and training. Prior approaches use cameras to generate training labels for privacy-preserving sensors (LiDAR, RADAR, Thermal), but this forces sensors to detect predefined activities that cameras can see yet the sensors themselves cannot reliably distinguish. In this work, we introduce OrganicHAR, an activity discovery framework that inverts this relationship by placing sensor capabilities at the center of activity discovery. Our approach identifies naturally occurring signal patterns using privacy-preserving sensors, leverages Vision Language Models (VLMs) only during these key moments for scene understanding, and discovers discrete activity labels at granularities that these sensors can reliably detect. Our evaluation with 12 participants demonstrates OrganicHAR's effectiveness: it achieves 79% accuracy for coarse (4-5) activities using only basic ambient sensors (radar, lidar, thermal arrays), and 73% accuracy for fine-grained (8-9) activities when a wearable IMU, depth, and pose sensor are added. OrganicHAR maintains 77% accuracy on average across configurations while discovering 4-8 categories per user (15 across all users) tailored to each environment and sensor capabilities. By triggering video processing only at key moments identified by local sensors, we reduce queries to VLM by 90%, enabling practical and privacy-preserving activity recognition in natural settings.