M2H-MX: Multi-Task Dense Visual Perception for Real-Time Monocular Spatial Understanding

📅 2026-03-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of simultaneously achieving high accuracy in depth and semantic prediction while maintaining system stability in real-time dense spatial perception and mapping with monocular cameras. To this end, we propose M2H-MX, a novel model that preserves multi-scale features within a lightweight decoder and introduces a register-gated global context mechanism alongside controlled cross-task interaction to mutually enhance depth and semantic predictions under low latency constraints. The method integrates seamlessly into standard monocular SLAM pipelines via a compact interface. As the first approach to efficiently embed multi-task dense prediction into real-time monocular SLAM, M2H-MX achieves a 6.6% improvement in semantic mIoU and a 9.4% reduction in depth RMSE on NYUDv2, and reduces mapping trajectory error by 60.7% on ScanNet, yielding clearer and more consistent metric-semantic maps.
📝 Abstract
Monocular cameras are attractive for robotic perception due to their low cost and ease of deployment, yet achieving reliable real-time spatial understanding from a single image stream remains challenging. While recent multi-task dense prediction models have improved per-pixel depth and semantic estimation, translating these advances into stable monocular mapping systems is still non-trivial. This paper presents M2H-MX, a real-time multi-task perception model for monocular spatial understanding. The model preserves multi-scale feature representations while introducing register-gated global context and controlled cross-task interaction in a lightweight decoder, enabling depth and semantic predictions to reinforce each other under strict latency constraints. Its outputs integrate directly into an unmodified monocular SLAM pipeline through a compact perception-to-mapping interface. We evaluate both dense prediction accuracy and in-the-loop system performance. On NYUDv2, M2H-MX-L achieves state-of-the-art results, improving semantic mIoU by 6.6% and reducing depth RMSE by 9.4% over representative multi-task baselines. When deployed in a real-time monocular mapping system on ScanNet, M2H-MX reduces average trajectory error by 60.7% compared to a strong monocular SLAM baseline while producing cleaner metric-semantic maps. These results demonstrate that modern multi-task dense prediction can be reliably deployed for real-time monocular spatial perception in robotic systems.
Problem

Research questions and friction points this paper is trying to address.

monocular spatial understanding
multi-task dense prediction
real-time perception
monocular SLAM
depth and semantic estimation
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-task dense prediction
monocular spatial understanding
register-gated global context
cross-task interaction
real-time SLAM integration
🔎 Similar Papers
No similar papers found.
U
U. V. B. L. Udugama
Department of Earth Observation Science, University of Twente, 7522 NH Enschede, The Netherlands
George Vosselman
George Vosselman
University of Twente
photogrammetrylaser scanning
Francesco Nex
Francesco Nex
University of Twente - ITC Faculty
photogrammetryremote sensingcomputer visionUAVdrones