🤖 AI Summary
This study addresses the inefficiency in locating critical events and assessing their intensity in body-worn camera footage due to the absence of structured annotations. To overcome this, the authors propose a two-dimensional structured labeling framework that jointly captures operational context and activity intensity. The video is segmented into temporally aligned 10-second windows, and under strict privacy-preserving constraints, low-evidence labels are introduced to handle ambiguous, dark, or occluded segments. Window-level representations are constructed by integrating CLIP-derived frame features with dense optical flow statistics. Through multi-frame aggregation and integrity auditing, the system achieves 78.75% accuracy in context classification and 88.33% in intensity classification on the test set, substantially improving event retrieval efficiency and training utility while enabling the generation of interpretable visual timelines.
📝 Abstract
Law enforcement agencies are accumulating vast amounts of body-worn camera (BWC) footage. However, this remains operationally opaque. That is, analysts and trainers still have to invest considerable time watching full-length videos to pinpoint the start of key encounters and identify the points where activity shifts to something more physically intense. We present an approach to process BWC video into a time-aligned sequence of fixed-length 10-second windows, processed and labeled using a privacy-conscious protocol. Each window is labeled with two dimensions of information: (i) the operational context of the window and (ii) the level of motion intensity within the window, with low-evidence labels for windows for which insufficient evidence exists due to darkness, blur or occlusion. We train models to classify windows based on these two axes using frames sampled from each window encoded using CLIP model and aggregated into a window-level representation. We extract dense optical flow statistics for each window to capture motion intensity. On test windows the best context model achieves 78.75% accuracy, and the best-accuracy activity model achieves 88.33%. We also included integrity audits to show the results and how the visual timeline representations support faster incident review and make the officer training workflow more practical.