Open-Vocabulary Spatio-Temporal Scene Graph for Robot Perception and Teleoperation Planning

📅 2025-09-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address perception-action misalignment caused by communication latency in dynamic teleoperation, this paper proposes the Spatio-Temporal Open-Vocabulary Scene Graph (ST-OVSG) — a unified representation framework. ST-OVSG jointly models 3D semantic perception and temporal dynamics, introducing a novel latency-aware annotation mechanism to enable historical state retrieval, and employs temporal matching cost with the Hungarian algorithm for robust cross-temporal object association. A task-oriented subgraph filtering strategy is further designed to improve planning efficiency. Leveraging a Large Vision-Language Model (LVLM), ST-OVSG constructs open-vocabulary 3D object representations, significantly enhancing semantic generalization. On the Replica benchmark, ST-OVSG achieves 74% node accuracy—surpassing ConceptGraph. Under simulated latency, its task planning success rate reaches 70.5%, demonstrating strong robustness and practical efficacy in real-world teleoperation scenarios.

Technology Category

Application Category

📝 Abstract
Teleoperation via natural-language reduces operator workload and enhances safety in high-risk or remote settings. However, in dynamic remote scenes, transmission latency during bidirectional communication creates gaps between remote perceived states and operator intent, leading to command misunderstanding and incorrect execution. To mitigate this, we introduce the Spatio-Temporal Open-Vocabulary Scene Graph (ST-OVSG), a representation that enriches open-vocabulary perception with temporal dynamics and lightweight latency annotations. ST-OVSG leverages LVLMs to construct open-vocabulary 3D object representations, and extends them into the temporal domain via Hungarian assignment with our temporal matching cost, yielding a unified spatio-temporal scene graph. A latency tag is embedded to enable LVLM planners to retrospectively query past scene states, thereby resolving local-remote state mismatches caused by transmission delays. To further reduce redundancy and highlight task-relevant cues, we propose a task-oriented subgraph filtering strategy that produces compact inputs for the planner. ST-OVSG generalizes to novel categories and enhances planning robustness against transmission latency without requiring fine-tuning. Experiments show that our method achieves 74 percent node accuracy on the Replica benchmark, outperforming ConceptGraph. Notably, in the latency-robustness experiment, the LVLM planner assisted by ST-OVSG achieved a planning success rate of 70.5 percent.
Problem

Research questions and friction points this paper is trying to address.

Resolving command misunderstanding from transmission latency gaps
Mitigating local-remote state mismatches caused by communication delays
Enhancing robot planning robustness against dynamic scene latency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-vocabulary 3D object representations using LVLMs
Temporal extension via Hungarian assignment with matching cost
Latency tags and task-oriented subgraph filtering for planners
🔎 Similar Papers
No similar papers found.
Y
Yi Wang
HI-Robot Lab, School of Automation Science and Engineering, South China University of Technology
Z
Zeyu Xue
HI-Robot Lab, School of Automation Science and Engineering, South China University of Technology
Mujie Liu
Mujie Liu
Federation University Australia
Graph LearningBrain Network AnalysisTime Series Anomaly Detection
T
Tongqin Zhang
HI-Robot Lab, School of Automation Science and Engineering, South China University of Technology
Y
Yan Hu
Institute of AI Industries, Chinese Academy of Science, 211135, China
Zhou Zhao
Zhou Zhao
Zhejiang University
Machine LearningData MiningMultimedia Computing
Chenguang Yang
Chenguang Yang
Chair Professor in Robotics, Fellow of IEEE, IET, IMechE, AIAA, BCS
Robotics
Z
Zhenyu Lu
HI-Robot Lab, School of Automation Science and Engineering, South China University of Technology