A Unified 3D Object Perception Framework for Real-Time Outside-In Multi-Camera Systems

📅 2026-01-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of 3D perception in infrastructure-based “outside-in” multi-camera systems, where heterogeneous camera layouts and extreme occlusions severely degrade performance. To this end, the authors propose a unified 3D perception framework built upon the Sparse4D architecture, integrating geometric priors in world coordinates with occlusion-aware ReID embeddings. They further enhance appearance invariance through NVIDIA COSMOS–based generative Sim2Real augmentation, eliminating the need for manual annotations. An efficient TensorRT plugin leveraging Multi-Scale Deformable Attention (MSDA) is developed to enable high-speed inference. The proposed method achieves state-of-the-art performance on the AI City Challenge 2025 with an HOTA score of 45.22, delivering a 2.15× speedup in inference throughput and supporting concurrent processing of over 64 camera streams on a single Blackwell GPU.

Technology Category

Application Category

📝 Abstract
Accurate 3D object perception and multi-target multi-camera (MTMC) tracking are fundamental for the digital transformation of industrial infrastructure. However, transitioning"inside-out"autonomous driving models to"outside-in"static camera networks presents significant challenges due to heterogeneous camera placements and extreme occlusion. In this paper, we present an adapted Sparse4D framework specifically optimized for large-scale infrastructure environments. Our system leverages absolute world-coordinate geometric priors and introduces an occlusion-aware ReID embedding module to maintain identity stability across distributed sensor networks. To bridge the Sim2Real domain gap without manual labeling, we employ a generative data augmentation strategy using the NVIDIA COSMOS framework, creating diverse environmental styles that enhance the model's appearance-invariance. Evaluated on the AI City Challenge 2025 benchmark, our camera-only framework achieves a state-of-the-art HOTA of $45.22$. Furthermore, we address real-time deployment constraints by developing an optimized TensorRT plugin for Multi-Scale Deformable Aggregation (MSDA). Our hardware-accelerated implementation achieves a $2.15\times$ speedup on modern GPU architectures, enabling a single Blackwell-class GPU to support over 64 concurrent camera streams.
Problem

Research questions and friction points this paper is trying to address.

3D object perception
multi-camera tracking
occlusion
domain gap
real-time deployment
Innovation

Methods, ideas, or system contributions that make the work stand out.

outside-in perception
occlusion-aware ReID
generative data augmentation
TensorRT acceleration
Multi-Scale Deformable Aggregation
🔎 Similar Papers
No similar papers found.
Yizhou Wang
Yizhou Wang
NVIDIA; University of Washington
Computer VisionDeep LearningAutonomous Driving
S
S. Pusegaonkar
NVIDIA Corporation
Y
Yuxing Wang
NVIDIA Corporation
A
Anqi Li
NVIDIA Corporation
V
Vishal Kumar
NVIDIA Corporation
C
Chetan Sethi
NVIDIA Corporation
G
G. Aiyer
NVIDIA Corporation
Yun He
Yun He
Research Scientist at Meta
data mininginformation retrieval and recommender systems
K
Kartikay Thakkar
NVIDIA Corporation
S
Swapnil Rathi
NVIDIA Corporation
B
Bhushan Rupde
NVIDIA Corporation
Zheng Tang
Zheng Tang
Senior Deep Learning Engineer, NVIDIA
Computer VisionMachine LearningDeep LearningArtificial Intelligence
S
Sujit Biswas
NVIDIA Corporation