DepthVision: Robust Vision-Language Understanding through GAN-Based LiDAR-to-RGB Synthesis

📅 2025-09-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Robust robotic manipulation under visual degradation (e.g., low-light conditions) remains a critical challenge. To address this, we propose a multimodal enhancement framework for vision-language understanding: first, we synthesize high-fidelity RGB images from LiDAR data using a conditional generative adversarial network coupled with a refinement network; second, we introduce a luminance-aware modality adaptation mechanism that dynamically fuses real and synthetic RGB images—without fine-tuning downstream vision-language models—to achieve illumination-invariant feature alignment. Our approach is compatible with mainstream vision-language models and is validated on both real-world and photorealistic multi-illumination datasets. It significantly outperforms RGB-only baselines, improving perception accuracy and task safety in low-light scenarios. The method provides a deployable, robust multimodal understanding solution for safety-critical robotic systems.

Technology Category

Application Category

📝 Abstract
Ensuring reliable robot operation when visual input is degraded or insufficient remains a central challenge in robotics. This letter introduces DepthVision, a framework for multimodal scene understanding designed to address this problem. Unlike existing Vision-Language Models (VLMs), which use only camera-based visual input alongside language, DepthVision synthesizes RGB images from sparse LiDAR point clouds using a conditional generative adversarial network (GAN) with an integrated refiner network. These synthetic views are then combined with real RGB data using a Luminance-Aware Modality Adaptation (LAMA), which blends the two types of data dynamically based on ambient lighting conditions. This approach compensates for sensor degradation, such as darkness or motion blur, without requiring any fine-tuning of downstream vision-language models. We evaluate DepthVision on real and simulated datasets across various models and tasks, with particular attention to safety-critical tasks. The results demonstrate that our approach improves performance in low-light conditions, achieving substantial gains over RGB-only baselines while preserving compatibility with frozen VLMs. This work highlights the potential of LiDAR-guided RGB synthesis for achieving robust robot operation in real-world environments.
Problem

Research questions and friction points this paper is trying to address.

Addressing robot vision degradation in low-light conditions
Synthesizing RGB images from LiDAR using GANs
Enhancing multimodal understanding without VLM fine-tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

GAN-based LiDAR-to-RGB image synthesis
Luminance-aware dynamic data fusion
No fine-tuning required for VLMs
Sven Kirchner
Sven Kirchner
Technical University of Munich
N
Nils Purschke
Chair of Robotics, Artificial Intelligence and Real-time Systems, Technical University of Munich, 80333 Munich, Germany
Ross Greer
Ross Greer
University of California Merced
Artificial IntelligenceMachine VisionAutonomous DrivingHuman-Robot InteractionComputer Music
A
Alois C. Knoll
Chair of Robotics, Artificial Intelligence and Real-time Systems, Technical University of Munich, 80333 Munich, Germany