Concept-Based Dictionary Learning for Inference-Time Safety in Vision Language Action Models

📅 2026-02-02

📈 Citations: 0

✨ Influential: 0

career value

234K/year

🤖 AI Summary

This work addresses the vulnerability of vision-language-action models to safety threats when executing multimodal instructions, a challenge exacerbated by the limitations of existing defenses that often suffer from delayed intervention or modality misalignment. To overcome these issues, we propose a concept-based dictionary learning framework that, during inference, identifies and suppresses harmful concept directions through sparse, interpretable activation dictionaries, enabling real-time blocking of unsafe behaviors. Our approach achieves the first inference-time, concept-level safety control for embodied systems, offering interpretability, plug-and-play deployment, and model-agnostic applicability without requiring retraining. Evaluated on benchmarks including Libero-Harm, BadRobot, RoboPair, and IS-Bench, the method reduces attack success rates by over 70% while maintaining high task completion performance.

Technology Category

Application Category

📝 Abstract

Vision Language Action (VLA) models close the perception action loop by translating multimodal instructions into executable behaviors, but this very capability magnifies safety risks: jailbreaks that merely yield toxic text in LLMs can trigger unsafe physical actions in embodied systems. Existing defenses alignment, filtering, or prompt hardening intervene too late or at the wrong modality, leaving fused representations exploitable. We introduce a concept-based dictionary learning framework for inference-time safety control. By constructing sparse, interpretable dictionaries from hidden activations, our method identifies harmful concept directions and applies threshold-based interventions to suppress or block unsafe activations. Experiments on Libero-Harm, BadRobot, RoboPair, and IS-Bench show that our approach achieves state-of-the-art defense performance, cutting attack success rates by over 70\% while maintaining task success. Crucially, the framework is plug-in and model-agnostic, requiring no retraining and integrating seamlessly with diverse VLAs. To our knowledge, this is the first inference-time concept-based safety method for embodied systems, advancing both interpretability and safe deployment of VLA models.

Problem

Research questions and friction points this paper is trying to address.

Vision Language Action models

inference-time safety

jailbreak attacks

embodied systems

unsafe physical actions

Innovation

Methods, ideas, or system contributions that make the work stand out.

concept-based dictionary learning

inference-time safety

vision language action models