Mono-Hydra++: Real-Time Monocular Scene Graph Construction with Multi-Task Learning for 3D Indoor Mapping

📅 2026-05-17

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

This work addresses the challenge of real-time, semantically rich 3D scene graph construction on resource-constrained agile robots using only monocular RGB and IMU inputs. The authors propose Mono-Hydra++, the first system to achieve real-time metric-semantic mapping and hierarchical 3D scene graph generation on lightweight platforms without active depth sensors. The approach integrates a DINOv3-based multi-task model (M2H-MX), visual-inertial odometry, pose-aware temporal alignment, sparse depth constraints, and dynamic-region semantic masking, and is efficiently deployed via ONNX/TensorRT. Experiments demonstrate that Mono-Hydra++ reduces trajectory error by 1.6% compared to the strongest RGB-D baseline on Go-SLAM ScanNet and improves absolute trajectory error by 29.8% on 7-Scenes, while achieving 25.53 FPS on a Jetson Orin NX.

📝 Abstract

Autonomous agile robots need more than metric geometry: they must understand objects, rooms, places, and spatial relations for search, inspection, exploration, and human robot interaction. Conventional metric maps support localization and collision avoidance, but do not provide this semantic and relational structure. 3D scene graphs address this gap by connecting geometry with object level and room level understanding. Building such representations on agile platforms remains difficult because aerial and lightweight robots operate under strict payload, power, and compute limits, making RGB-D cameras and LiDAR sensors impractical for many onboard settings. We present Mono-Hydra++, a real time monocular RGB plus IMU pipeline for indoor metric semantic mapping and hierarchical 3D scene graph construction. The system combines M2H-MX, a DINOv3 based multi-task model for depth and semantics, with a deep feature visual inertial odometry front end, sparse predicted depth constraints in the VIO derived pose graph, semantic masking for dynamic regions, and pose aware temporal alignment before volumetric fusion in the Mono-Hydra backend. On the Go-SLAM ScanNet evaluation subset, Mono-Hydra++ achieves 1.6% lower average trajectory error than the strongest RGB-D baseline in our comparison, while using only monocular RGB plus IMU input. On calibrated 7-Scenes, it improves average ATE by 29.8% over the strongest competing calibrated baseline. We further validate Mono-Hydra++ in a real ITC building deployment using RealSense RGB plus IMU and demonstrate embedded feasibility by deploying the ONNX/TensorRT FP16 M2H-MX-L perception model at 25.53 FPS on a Jetson Orin NX 16GB. These results show that Mono-Hydra++ can provide real time metric semantic mapping and scene graph construction for resource constrained robotic platforms without relying on active depth sensors.

Problem

Research questions and friction points this paper is trying to address.

monocular scene graph

3D indoor mapping

semantic mapping

resource-constrained robotics

real-time perception

Innovation

Methods, ideas, or system contributions that make the work stand out.

monocular scene graph

multi-task learning

visual-inertial odometry