Precise Shield: Explaining and Aligning VLLM Safety via Neuron-Level Guidance

📅 2026-04-10

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This work addresses the vulnerability of current vision-language foundation models to multimodal attacks that combine low-resource languages with harmful imagery, stemming from a lack of effective cross-lingual and cross-modal safety mechanisms. The authors propose a two-stage safety alignment framework: first identifying critical “safety neurons” by contrasting activation patterns between harmful and benign inputs, then applying gradient masking to restrict parameter updates exclusively within this neuron subspace. This approach achieves substantial safety improvements with minimal perturbation—altering fewer than 0.03% of parameters. The study further reveals, for the first time, a moderate overlap among safety neurons across languages and modalities, enabling zero-shot transfer. By establishing a neuron-level safety alignment paradigm, the method effectively mitigates structural blind-spot attacks while preserving strong multilingual and multimodal generalization capabilities.

Technology Category

Application Category

📝 Abstract

In real-world deployments, Vision-Language Large Models (VLLMs) face critical challenges from multilingual and multimodal composite attacks: harmful images paired with low-resource language texts can easily bypass defenses designed for high-resource language scenarios, exposing structural blind spots in current cross-lingual and cross-modal safety methods. This raises a mechanistic question: where is safety capability instantiated within the model, and how is it distributed across languages and modalities? Prior studies on pure-text LLMs have identified cross-lingual shared safety neurons, suggesting that safety may be governed by a small subset of critical neurons. Leveraging this insight, we propose Precise Shield, a two-stage framework that first identifies safety neurons by contrasting activation patterns between harmful and benign inputs, and then constrains parameter updates strictly within this subspace via gradient masking with affecting fewer than 0.03% of parameters. This strategy substantially improves safety while preserving multilingual and multimodal generalization. Further analysis reveals a moderate overlap of safety neurons across languages and modalities, enabling zero-shot cross-lingual and cross-modal transfer of safety capabilities, and offering a new direction for neuron-level, transfer-based safety enhancement.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language Large Models

multilingual safety

multimodal attacks

cross-lingual alignment

safety mechanisms

Innovation

Methods, ideas, or system contributions that make the work stand out.

safety neurons

gradient masking

cross-lingual alignment