Model Optimization for Multi-Camera 3D Detection and Tracking

๐Ÿ“… 2026-01-31
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the challenges of 3D object detection and multi-object tracking in indoor multi-camera systems under occlusion, heterogeneous viewpoints, and low frame rates. Building upon the Sparse4D framework, it fuses multi-view features in a unified world coordinate system and introduces an instance memory mechanism to propagate sparse object queries across frames. The study proposes the Average Track Duration metric to evaluate identity persistence, revealing severe identity association collapse under low frame rates. It further identifies that attention modules are highly sensitive to quantization, whereas backbone and neck components can be selectively quantized. Experiments demonstrate that moderate frame rate reduction preserves performance, selective post-training INT8/FP8 quantization achieves an optimal speedโ€“accuracy trade-off, and fine-tuning with Transformer Engine mixed-precision significantly accelerates inference.

Technology Category

Application Category

๐Ÿ“ Abstract
Outside-in multi-camera perception is increasingly important in indoor environments, where networks of static cameras must support multi-target tracking under occlusion and heterogeneous viewpoints. We evaluate Sparse4D, a query-based spatiotemporal 3D detection and tracking framework that fuses multi-view features in a shared world frame and propagates sparse object queries via instance memory. We study reduced input frame rates, post-training quantization (INT8 and FP8), transfer to the WILDTRACK benchmark, and Transformer Engine mixed-precision fine-tuning. To better capture identity stability, we report Average Track Duration (AvgTrackDur), which measures identity persistence in seconds. Sparse4D remains stable under moderate FPS reductions, but below 2 FPS, identity association collapses even when detections are stable. Selective quantization of the backbone and neck offers the best speed-accuracy trade-off, while attention-related modules are consistently sensitive to low precision. On WILDTRACK, low-FPS pretraining yields large zero-shot gains over the base checkpoint, while small-scale fine-tuning provides limited additional benefit. Transformer Engine mixed precision reduces latency and improves camera scalability, but can destabilize identity propagation, motivating stability-aware validation.
Problem

Research questions and friction points this paper is trying to address.

multi-camera 3D tracking
occlusion
heterogeneous viewpoints
identity stability
model optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse4D
multi-camera 3D tracking
post-training quantization
mixed-precision fine-tuning
identity stability
๐Ÿ”Ž Similar Papers
No similar papers found.
E
Ethan Anderson
Clemson University
J
Justin Silva
Clemson University
K
Kyle Zheng
Clemson University
S
Sameer Pusegaonkar
NVIDIA
Yizhou Wang
Yizhou Wang
NVIDIA; University of Washington
Computer VisionDeep LearningAutonomous Driving
Zheng Tang
Zheng Tang
Senior Deep Learning Engineer, NVIDIA
Computer VisionMachine LearningDeep LearningArtificial Intelligence
S
Sujit Biswas
NVIDIA