🤖 AI Summary
This work proposes GraySense, a novel framework that enables geospatial target tracking using only encrypted wireless video traffic when raw sensor data is unavailable. By leveraging metadata such as packet sizes from encrypted video streams, GraySense employs a two-stage architecture: a Packet Grouping module estimates frame boundaries and frame sizes, while a Tracker module integrates indirect traffic signals—optionally fused with visual inputs—via a recurrent state-space Transformer to achieve precise localization. Evaluated in CARLA simulations under realistic network conditions, the method attains an average tracking error of 2.33 meters using solely encrypted traffic, outperforming typical vehicle dimensions (4.61 m × 1.93 m) and substantially expanding the applicability of implicit network signals for perception tasks.
📝 Abstract
Accurate observation of dynamic environments traditionally relies on synthesizing raw, signal-level information from multiple distributed sensors. This work investigates an alternative approach: performing geospatial inference using only encrypted packet-level information, without access to the raw sensory data. We further explore how this indirect information can be fused with directly available sensory data to extend overall inference capabilities. We introduce GraySense, a learning-based framework that performs geospatial object tracking by analyzing encrypted wireless video transmission traffic, such as packet sizes, from cameras with inaccessible streams. GraySense leverages the inherent relationship between scene dynamics and transmitted packet sizes to infer object motion. The framework consists of two stages: (1) a Packet Grouping module that identifies frame boundaries and estimates frame sizes from encrypted network traffic, and (2) a Tracker module, based on a Transformer encoder with a recurrent state, which fuses indirect packet-based inputs with optional direct camera-based inputs to estimate the object's position. Extensive experiments with realistic videos from the CARLA simulator and emulated networks under varying conditions show that GraySense achieves 2.33 meters tracking error (Euclidean distance) without raw signal access, within the dimensions of tracked objects (4.61m x 1.93m). To our knowledge, this capability has not been previously demonstrated, expanding the use of latent signals for sensing.