🤖 AI Summary
In sensitive environments (e.g., homes and offices), raw IoT data streams inherently expose identifiable information—such as location and behavioral patterns—posing privacy risks, while conventional data minimization techniques rely on binary correlation measures ill-suited for weak signals and dynamic streaming settings. Method: We propose the first sensor-stream-oriented data minimization framework, integrating the information bottleneck principle with differential privacy concepts to design an online feature selection mechanism, lightweight streaming model pruning, and semantic-aware redundancy assessment. Contribution/Results: We formally define computable data minimization for streaming IoT data and enable fine-grained, adaptive removal of non-essential signals. Experiments across multi-source IoT streams demonstrate a 16.7% reduction in user identifiability with <1% degradation in classification and prediction accuracy—fully satisfying GDPR’s data minimization principle.
📝 Abstract
Machine learning can analyze vast amounts of data generated by IoT devices to identify patterns, make predictions, and enable real-time decision-making. By processing sensor data, machine learning models can optimize processes, improve efficiency, and enhance personalized user experiences in smart systems. However, IoT systems are often deployed in sensitive environments such as households and offices, where they may inadvertently expose identifiable information, including location, habits, and personal identifiers. This raises significant privacy concerns, necessitating the application of data minimization -- a foundational principle in emerging data regulations, which mandates that service providers only collect data that is directly relevant and necessary for a specified purpose. Despite its importance, data minimization lacks a precise technical definition in the context of sensor data, where collections of weak signals make it challenging to apply a binary"relevant and necessary"rule. This paper provides a technical interpretation of data minimization in the context of sensor streams, explores practical methods for implementation, and addresses the challenges involved. Through our approach, we demonstrate that our framework can reduce user identifiability by up to 16.7% while maintaining accuracy loss below 1%, offering a viable path toward privacy-preserving IoT data processing.