🤖 AI Summary
This work addresses the problem of achieving fine-grained instruction following in vision-language models (VLMs) without modifying their pretrained weights. We propose a lightweight activation-guidance module that dynamically modulates semantic interactions between visual and linguistic modalities via dimension-wise activation modulation and cross-layer adaptive guidance—requiring no predefined intervention layers or static control vectors, and introducing only 0.14% additional parameters. Our method learns latent-space embeddings for target and counterfactual behaviors using a novel multimodal dataset, VNIA, specifically curated for training and evaluation. Experiments demonstrate that our approach significantly outperforms existing intervention methods on instruction-following and hallucination suppression tasks, while preserving performance on non-target tasks. This validates activation engineering as an effective paradigm for controllable multimodal reasoning.
📝 Abstract
This work introduces SteerVLM, a lightweight steering module designed to guide Vision-Language Models (VLMs) towards outputs that better adhere to desired instructions. Our approach learns from the latent embeddings of paired prompts encoding target and converse behaviors to dynamically adjust activations connecting the language modality with image context. This allows for fine-grained, inference-time control over complex output semantics without modifying model weights while preserving performance on off-target tasks. Our steering module requires learning parameters equal to 0.14% of the original VLM's size. Our steering module gains model control through dimension-wise activation modulation and adaptive steering across layers without requiring pre-extracted static vectors or manual tuning of intervention points. Furthermore, we introduce VNIA (Visual Narrative Intent Alignment), a multimodal dataset specifically created to facilitate the development and evaluation of VLM steering techniques. Our method outperforms existing intervention techniques on steering and hallucination mitigation benchmarks for VLMs and proposes a robust solution for multimodal model control through activation engineering.