๐ค AI Summary
This work addresses the challenges of 3D object detection and multi-object tracking in indoor multi-camera systems under occlusion, heterogeneous viewpoints, and low frame rates. Building upon the Sparse4D framework, it fuses multi-view features in a unified world coordinate system and introduces an instance memory mechanism to propagate sparse object queries across frames. The study proposes the Average Track Duration metric to evaluate identity persistence, revealing severe identity association collapse under low frame rates. It further identifies that attention modules are highly sensitive to quantization, whereas backbone and neck components can be selectively quantized. Experiments demonstrate that moderate frame rate reduction preserves performance, selective post-training INT8/FP8 quantization achieves an optimal speedโaccuracy trade-off, and fine-tuning with Transformer Engine mixed-precision significantly accelerates inference.
๐ Abstract
Outside-in multi-camera perception is increasingly important in indoor environments, where networks of static cameras must support multi-target tracking under occlusion and heterogeneous viewpoints. We evaluate Sparse4D, a query-based spatiotemporal 3D detection and tracking framework that fuses multi-view features in a shared world frame and propagates sparse object queries via instance memory. We study reduced input frame rates, post-training quantization (INT8 and FP8), transfer to the WILDTRACK benchmark, and Transformer Engine mixed-precision fine-tuning. To better capture identity stability, we report Average Track Duration (AvgTrackDur), which measures identity persistence in seconds. Sparse4D remains stable under moderate FPS reductions, but below 2 FPS, identity association collapses even when detections are stable. Selective quantization of the backbone and neck offers the best speed-accuracy trade-off, while attention-related modules are consistently sensitive to low precision. On WILDTRACK, low-FPS pretraining yields large zero-shot gains over the base checkpoint, while small-scale fine-tuning provides limited additional benefit. Transformer Engine mixed precision reduces latency and improves camera scalability, but can destabilize identity propagation, motivating stability-aware validation.