Model Optimization for Multi-Camera 3D Detection and Tracking

📅 2026-01-31

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This work addresses the challenges of 3D object detection and multi-object tracking in indoor multi-camera systems under occlusion, heterogeneous viewpoints, and low frame rates. Building upon the Sparse4D framework, it fuses multi-view features in a unified world coordinate system and introduces an instance memory mechanism to propagate sparse object queries across frames. The study proposes the Average Track Duration metric to evaluate identity persistence, revealing severe identity association collapse under low frame rates. It further identifies that attention modules are highly sensitive to quantization, whereas backbone and neck components can be selectively quantized. Experiments demonstrate that moderate frame rate reduction preserves performance, selective post-training INT8/FP8 quantization achieves an optimal speed–accuracy trade-off, and fine-tuning with Transformer Engine mixed-precision significantly accelerates inference.

Technology Category

Application Category

📝 Abstract

Outside-in multi-camera perception is increasingly important in indoor environments, where networks of static cameras must support multi-target tracking under occlusion and heterogeneous viewpoints. We evaluate Sparse4D, a query-based spatiotemporal 3D detection and tracking framework that fuses multi-view features in a shared world frame and propagates sparse object queries via instance memory. We study reduced input frame rates, post-training quantization (INT8 and FP8), transfer to the WILDTRACK benchmark, and Transformer Engine mixed-precision fine-tuning. To better capture identity stability, we report Average Track Duration (AvgTrackDur), which measures identity persistence in seconds. Sparse4D remains stable under moderate FPS reductions, but below 2 FPS, identity association collapses even when detections are stable. Selective quantization of the backbone and neck offers the best speed-accuracy trade-off, while attention-related modules are consistently sensitive to low precision. On WILDTRACK, low-FPS pretraining yields large zero-shot gains over the base checkpoint, while small-scale fine-tuning provides limited additional benefit. Transformer Engine mixed precision reduces latency and improves camera scalability, but can destabilize identity propagation, motivating stability-aware validation.

Problem

Research questions and friction points this paper is trying to address.

multi-camera 3D tracking

occlusion

heterogeneous viewpoints

identity stability

model optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse4D

multi-camera 3D tracking

post-training quantization