SPORTS: Simultaneous Panoptic Odometry, Rendering, Tracking and Segmentation for Urban Scenes Understanding

📅 2025-10-14

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

In urban scene understanding, dynamic object interference, sensor sparsity, and limited field-of-view degrade segmentation accuracy and geometric consistency. To address these challenges, this paper proposes an end-to-end, iterative unified framework integrating video panoptic segmentation, visual odometry, and neural rendering. Its key contributions are: (1) an adaptive attention-based geometric fusion mechanism that aligns cross-task features guided by pose, depth, and optical flow; (2) a cross-task iterative optimization architecture enabling joint modeling and bidirectional enhancement of semantic, motion, and geometric information; and (3) a learnable dynamic-object confidence estimation module coupled with point-cloud-driven neural field rendering. Evaluated on three public benchmarks, the method achieves state-of-the-art performance across pose estimation, depth prediction, panoptic segmentation, and novel-view synthesis.

Technology Category

Application Category

📝 Abstract

The scene perception, understanding, and simulation are fundamental techniques for embodied-AI agents, while existing solutions are still prone to segmentation deficiency, dynamic objects'interference, sensor data sparsity, and view-limitation problems. This paper proposes a novel framework, named SPORTS, for holistic scene understanding via tightly integrating Video Panoptic Segmentation (VPS), Visual Odometry (VO), and Scene Rendering (SR) tasks into an iterative and unified perspective. Firstly, VPS designs an adaptive attention-based geometric fusion mechanism to align cross-frame features via enrolling the pose, depth, and optical flow modality, which automatically adjust feature maps for different decoding stages. And a post-matching strategy is integrated to improve identities tracking. In VO, panoptic segmentation results from VPS are combined with the optical flow map to improve the confidence estimation of dynamic objects, which enhances the accuracy of the camera pose estimation and completeness of the depth map generation via the learning-based paradigm. Furthermore, the point-based rendering of SR is beneficial from VO, transforming sparse point clouds into neural fields to synthesize high-fidelity RGB views and twin panoptic views. Extensive experiments on three public datasets demonstrate that our attention-based feature fusion outperforms most existing state-of-the-art methods on the odometry, tracking, segmentation, and novel view synthesis tasks.

Problem

Research questions and friction points this paper is trying to address.

Integrates panoptic segmentation with odometry to handle dynamic objects

Develops adaptive feature fusion for cross-frame alignment in urban scenes

Transforms sparse point clouds into neural fields for view synthesis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive attention-based fusion aligns cross-frame features

Panoptic segmentation improves dynamic object confidence estimation

Point-based rendering transforms sparse clouds into neural fields

🔎 Similar Papers

LEAP-VO: Long-term Effective Any Point Tracking for Visual Odometry