Safe-Night VLA: Seeing the Unseen via Thermal-Perceptive Vision-Language-Action Models for Safety-Critical Manipulation

📅 2026-03-05

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

This work addresses the limitations of existing vision-language-action (VLA) models, which rely solely on RGB inputs and lack explicit safety mechanisms, leading to failures in out-of-distribution scenarios. To overcome these challenges, the study introduces long-wave infrared (LWIR) thermal imaging into the VLA framework for the first time, enabling semantic reasoning grounded in thermodynamic properties through integration with pretrained vision-language models. Furthermore, safety constraints are formally embedded into the end-to-end policy via control barrier functions. Experiments on a Franka robotic arm demonstrate that the proposed approach significantly outperforms RGB-based baselines in tasks involving temperature-sensitive manipulation, subsurface object localization, and resolution of reflective ambiguities, while guaranteeing safety during inference in unstructured environments.

Technology Category

Application Category

📝 Abstract

Current Vision-Language-Action (VLA) models rely primarily on RGB perception, preventing them from capturing modalities such as thermal signals that are imperceptible to conventional visual sensors. Moreover, end-to-end generative policies lack explicit safety constraints, making them fragile when encountering obstacles and novel scenarios outside the training distribution. To address these limitations, we propose Safe-Night VLA, a multimodal manipulation framework that enables robots to see the unseen while enforcing rigorous safety constraints for thermal-aware manipulation in unstructured environments. Specifically, Safe-Night VLA integrates long-wave infrared thermal perception into a pre-trained vision-language backbone, enabling semantic reasoning grounded in thermodynamic properties. To ensure safe execution under out-of-distribution conditions, we incorporate a safety filter via control barrier functions, which provide deterministic workspace constraint enforcement during policy execution. We validate our framework through real-world experiments on a Franka manipulator, introducing a novel evaluation paradigm featuring temperature-conditioned manipulation, subsurface target localization, and reflection disambiguation, while maintaining constrained execution at inference time. Results demonstrate that Safe-Night VLA outperforms RGB-only baselines and provide empirical evidence that foundation models can effectively leverage non-visible physical modalities for robust manipulation.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action

thermal perception

safety constraints

out-of-distribution robustness

multimodal manipulation

Innovation

Methods, ideas, or system contributions that make the work stand out.

thermal perception

vision-language-action models

control barrier functions